Scrapling Web Scraping — MCP-Native Guidance

Guidance Layer + MCP Integration
Use this skill for strategy and patterns. For execution, call Scrapling's MCP server via mcporter.

Quick Start (MCP)

1. Install Scrapling with MCP support

CODEBLOCK0

2. Add to OpenClaw MCP config

CODEBLOCK1

3. Call via mcporter

CODEBLOCK2

Execution vs Guidance

Task	Tool	Example
Fetch a page	mcporter	INLINECODE1
Extract with CSS

Fetcher Selection Guide

CODEBLOCK3

Decision Tree

1. Static HTML? → Fetcher (10-100x faster)
Need JS execution? → INLINECODE4
Getting blocked? → INLINECODE5
Complex session? → Use Session variants

MCP Fetch Modes

- fetch_page — HTTP fetcher
INLINECODE7 — Browser-based with Playwright
INLINECODE8 — Anti-bot bypass mode

Anti-Bot Escalation Ladder

Level 1: Polite HTTP

CODEBLOCK4

Level 2: Session Persistence

CODEBLOCK5

Level 3: Stealth Mode

CODEBLOCK6

Level 4: Proxy Rotation

See INLINECODE9

Adaptive Scraping (Anti-Fragile)

Scrapling can survive website redesigns using adaptive selectors:

CODEBLOCK7

MCP usage:
CODEBLOCK8

Spider Framework (Large Crawls)

When to use Spiders vs direct fetching:

- ✅ Spider: 10+ pages, concurrency needed, resume capability, proxy rotation
✅ Direct: 1-5 pages, quick extraction, simple flow

Basic Spider Pattern

CODEBLOCK9

Advanced: Multi-Session Spider

CODEBLOCK10

Spider Features

- Pause/Resume: crawldir parameter saves checkpoints
Streaming: async for item in spider.stream() for real-time processing
Auto-retry: Configurable retry on blocked requests
Export: Built-in to_json(), INLINECODE13

CLI & Interactive Shell

Terminal Extraction (No Code)

CODEBLOCK11

Interactive Shell

CODEBLOCK12

Parser API (Beyond CSS/XPath)

BeautifulSoup-Style Methods

CODEBLOCK13

Auto-Generated Selectors

CODEBLOCK14

Proxy Rotation

CODEBLOCK15

Common Recipes

Pagination Patterns

CODEBLOCK16

Login Sessions

CODEBLOCK17

Next.js Data Extraction

CODEBLOCK18

Output Formats

CODEBLOCK19

Performance Tips

1. Use HTTP fetcher when possible — 10-100x faster than browser
Impersonate browsers — impersonate='chrome' for TLS fingerprinting
HTTP/3 support — INLINECODE15
Limit resources — disable_resources=True in Dynamic/Stealthy
Connection pooling — Reuse sessions across requests

Guardrails (Always)

- Only scrape content you're authorized to access
Respect robots.txt and ToS
Add delays (download_delay) for large crawls
Don't bypass paywalls or authentication without permission
Never scrape personal/sensitive data

References

- references/mcp-setup.md — Detailed MCP configuration
INLINECODE19 — Anti-bot handling strategies
INLINECODE20 — Proxy setup and rotation
INLINECODE21 — Advanced crawling patterns
INLINECODE22 — Quick API reference
INLINECODE23 — Official docs links

Scripts

- scripts/scrapling_scrape.py — Quick one-off extraction
INLINECODE25 — Test connectivity and anti-bot indicators

Scrapling 网页抓取 — MCP 原生指南

指导层 + MCP 集成
使用此技能进行策略和模式设计。如需执行，通过 mcporter 调用 Scrapling 的 MCP 服务器。

快速入门 (MCP)

1. 安装带 MCP 支持的 Scrapling

bash pip install scrapling[mcp]

或安装完整功能：

pip install scrapling[mcp,playwright] python -m playwright install chromium

2. 添加到 OpenClaw MCP 配置

json { mcpServers: { scrapling: { command: python, args: [-m, scrapling.mcp] } } }

3. 通过 mcporter 调用

mcporter call scrapling fetch_page --url https://example.com

执行与指导

任务	工具	示例
获取页面	mcporter	mcporter call scrapling fetchpage --url URL
使用 CSS 提取

抓取器选择指南

┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Fetcher │────▶│ DynamicFetcher │────▶│ StealthyFetcher │
│ (HTTP) │ │ (浏览器/JS) │ │ (反爬) │
└─────────────────┘ └──────────────────┘ └──────────────────┘
最快 JS 渲染 Cloudflare,
静态页面 SPA, React/Vue Turnstile 等

决策树

1. 静态 HTML？ → Fetcher（快 10-100 倍）
需要执行 JS？ → DynamicFetcher
被屏蔽？ → StealthyFetcher
复杂会话？ → 使用 Session 变体

MCP 抓取模式

- fetchpage — HTTP 抓取器
fetchdynamic — 基于 Playwright 的浏览器模式
fetch_stealthy — 反爬绕过模式

反爬升级阶梯

级别 1：礼貌 HTTP

python

MCP 调用：带选项的 fetch_page

{ url: https://example.com, headers: {User-Agent: ...}, delay: 2.0 }

级别 2：会话持久化

python

使用会话保持跨请求的 cookie/状态

FetcherSession(impersonate=chrome) # TLS 指纹伪装

级别 3：隐身模式

python

MCP：fetch_stealthy

StealthyFetcher.fetch( url, headless=True, solve_cloudflare=True, # 自动解决 Turnstile network_idle=True )

级别 4：代理轮换

参见 references/proxy-rotation.md

自适应抓取（反脆弱）

Scrapling 可以使用自适应选择器应对网站改版：

python

首次运行 — 保存指纹

products = page.css(.product, auto_save=True)

后续运行 — 如果 DOM 变化则自动重新定位

products = page.css(.product, adaptive=True)

MCP 用法：

mcporter call scrapling css_select \\
--selector .product \\
--adaptive true \\
--auto-save true

爬虫框架（大规模抓取）

何时使用爬虫 vs 直接抓取：

- ✅ 爬虫：10 页以上，需要并发，支持断点续传，代理轮换
✅ 直接：1-5 页，快速提取，简单流程

基础爬虫模式

python from scrapling.spiders import Spider, Response

class ProductSpider(Spider):
name = products
start_urls = [https://example.com/products]
concurrent_requests = 10
download_delay = 1.0

async def parse(self, response: Response):
for product in response.css(.product):
yield {
name: product.css(h2::text).get(),
price: product.css(.price::text).get(),
url: response.url
}

# 跟随分页
next_page = response.css(.next a::attr(href)).get()
if next_page:
yield response.follow(next_page)

带断点续传功能运行

result = ProductSpider(crawldir=./crawl_data).start() result.items.to_jsonl(products.jsonl)

高级：多会话爬虫

python from scrapling.spiders import Spider, Request, Response from scrapling.fetchers import FetcherSession, AsyncStealthySession

class MultiSessionSpider(Spider):
name = multi
start_urls = [https://example.com/]

def configure_sessions(self, manager):
manager.add(fast, FetcherSession(impersonate=chrome))
manager.add(stealth, AsyncStealthySession(headless=True), lazy=True)

async def parse(self, response: Response):
for link in response.css(a::attr(href)).getall():
if /protected/ in link:
yield Request(link, sid=stealth)
else:
yield Request(link, sid=fast)

爬虫特性

- 暂停/恢复：crawldir 参数保存检查点
流式处理：async for item in spider.stream() 实时处理
自动重试：可配置被屏蔽请求的重试
导出：内置 tojson()、tojsonl()

CLI 与交互式 Shell

终端提取（无需代码）

bash

提取为 markdown

scrapling extract get https://example.com content.md

提取特定元素

scrapling extract get https://example.com content.txt \\ --css-selector .article \\ --impersonate chrome

隐身模式

scrapling extract stealthy-fetch https://protected.com content.md \\ --no-headless \\ --solve-cloudflare

交互式 Shell

bash scrapling shell

在 shell 中：

>>> page = Fetcher.get(https://example.com) >>> page.css(h1::text).get() >>> page.findall(div, class=item)

解析器 API（超越 CSS/XPath）

BeautifulSoup 风格方法

python

按属性查找

page.find_all(div, {class: product, data-id: True}) page.findall(div, class=product, id=re.compile(ritem-\\d+))

文本搜索

page.findbytext(Add to Cart, tag=button) page.findbyregex(r\\$\\d+\\.\\d{2})

相似性

similar = first.find_similar() # 查找视觉/结构相似的元素 below = first.below_elements() # DOM 中下方的元素

自动生成选择器

python

获取任意元素的稳健选择器

element = page.css(.product)[0] selector = element.autocssselector() # 返回稳定的 CSS 路径 xpath = element.auto_xpath()

代理轮换

python
from scrapling.spiders import ProxyRotator

循环轮换

rotator = ProxyRotator([ http://proxy1:8080, http://proxy2:8080, http://user:pass@proxy3:8080 ], strategy=cyclic)

与任何会话一起使用

with FetcherSession(proxy=rotator.next()) as session: page = session.get(https://example.com)

常用配方

分页模式

python

页码

for page_num in range(1, 11): url = fhttps://example.com/products?page={page_num} ...

下一页按钮

while next_page := response.css(.next a::attr(href)).get(): yield response.follow(next

scrapling高级网页抓取

scrapling

Scrapling Web Scraping — MCP-Native Guidance

Quick Start (MCP)

1. Install Scrapling with MCP support

2. Add to OpenClaw MCP config

3. Call via mcporter

Execution vs Guidance

Fetcher Selection Guide

Decision Tree

MCP Fetch Modes

Anti-Bot Escalation Ladder

Level 1: Polite HTTP

Level 2: Session Persistence

Level 3: Stealth Mode

Level 4: Proxy Rotation

Adaptive Scraping (Anti-Fragile)

Spider Framework (Large Crawls)

Basic Spider Pattern

Advanced: Multi-Session Spider

Spider Features

CLI & Interactive Shell

Terminal Extraction (No Code)

Interactive Shell

Parser API (Beyond CSS/XPath)

BeautifulSoup-Style Methods

Auto-Generated Selectors

Proxy Rotation

Common Recipes

Pagination Patterns

Login Sessions

Next.js Data Extraction

Output Formats

Performance Tips

Guardrails (Always)

References

Scripts

Scrapling 网页抓取 — MCP 原生指南

快速入门 (MCP)

1. 安装带 MCP 支持的 Scrapling

或安装完整功能：

2. 添加到 OpenClaw MCP 配置

3. 通过 mcporter 调用

执行与指导

抓取器选择指南

决策树

MCP 抓取模式

反爬升级阶梯

级别 1：礼貌 HTTP

MCP 调用：带选项的 fetch_page

级别 2：会话持久化

使用会话保持跨请求的 cookie/状态

级别 3：隐身模式

MCP：fetch_stealthy

级别 4：代理轮换

自适应抓取（反脆弱）

首次运行 — 保存指纹

后续运行 — 如果 DOM 变化则自动重新定位

爬虫框架（大规模抓取）

基础爬虫模式

带断点续传功能运行

高级：多会话爬虫

爬虫特性

CLI 与交互式 Shell

终端提取（无需代码）

提取为 markdown

提取特定元素

隐身模式

交互式 Shell

在 shell 中：

解析器 API（超越 CSS/XPath）

BeautifulSoup 风格方法

按属性查找

文本搜索

导航

相似性

自动生成选择器

获取任意元素的稳健选择器

代理轮换

循环轮换