OpenClaw Ultra Scraping

Powered by MyClaw.ai — the AI personal assistant platform that gives every user a full server with complete code control. Part of the MyClaw open skills ecosystem.

Handles everything from single-page extraction to full-scale concurrent crawls with anti-bot bypass.

Setup

Run once before first use:

CODEBLOCK0

This installs Scrapling + all browser dependencies into /opt/scrapling-venv.

Quick Start — CLI Script

The bundled scripts/scrape.py provides a unified CLI:

CODEBLOCK1

Quick Start — Python

For complex tasks, write Python directly using the venv:

CODEBLOCK2

Fetcher Selection Guide

Scenario	Fetcher	Flag
Normal sites, fast scraping	INLINECODE2	(default)
JS-rendered SPAs

Selector Cheat Sheet

CODEBLOCK3

Advanced Features

- Adaptive tracking: auto_save=True on first run, adaptive=True later — elements are found even after site redesign
Proxy rotation: Pass proxy="http://host:port" or use INLINECODE12
Sessions: FetcherSession, StealthySession, DynamicSession for cookie/state persistence
Spider framework: Scrapy-like concurrent crawling with pause/resume
Async support: All fetchers have async variants

For full API details: read INLINECODE16

OpenClaw Ultra Scraping

由 MyClaw.ai 提供技术支持——这是一款AI个人助手平台，为每位用户提供拥有完整代码控制的专属服务器。属于 MyClaw开放技能生态系统的一部分。

处理从单页面提取到大规模并发爬取的一切任务，并具备反机器人绕过功能。

安装

首次使用前运行一次：

bash
bash scripts/setup.sh

此操作将 Scrapling 及所有浏览器依赖安装到 /opt/scrapling-venv 目录。

快速入门 — CLI 脚本

附带的 scripts/scrape.py 提供了统一的命令行接口：

bash
PYTHON=/opt/scrapling-venv/bin/python3

简单抓取（JSON输出）

$PYTHON scripts/scrape.py fetch https://example.com --css .content

提取文本

$PYTHON scripts/scrape.py extract https://example.com --css h1

隐身模式（绕过Cloudflare）

$PYTHON scripts/scrape.py fetch https://protected-site.com --stealth --solve-cloudflare --css .data

动态模式（完整浏览器渲染）

$PYTHON scripts/scrape.py fetch https://spa-site.com --dynamic --css .product

提取链接

$PYTHON scripts/scrape.py links https://example.com --filter \.pdf$

多页面爬取

$PYTHON scripts/scrape.py crawl https://example.com --depth 2 --concurrency 10 --css .item -o results.json

输出格式：json, jsonl, csv, text, markdown, html

$PYTHON scripts/scrape.py fetch https://example.com -f markdown -o page.md

快速入门 — Python

对于复杂任务，可直接使用虚拟环境编写Python代码：

python
#!/opt/scrapling-venv/bin/python3
from scrapling.fetchers import Fetcher, StealthyFetcher

简单HTTP请求

page = Fetcher.get(https://example.com, impersonate=chrome) titles = page.css(h1::text).getall()

绕过Cloudflare

page = StealthyFetcher.fetch(https://protected.com, headless=True, solve_cloudflare=True) data = page.css(.product).getall()

抓取器选择指南

场景	抓取器	标志
普通网站，快速抓取	Fetcher	(默认)
JS渲染的SPA应用

选择器速查表

python
page.css(.class) # CSS选择器
page.css(.class::text).getall() # 文本提取
page.xpath(//div[@id=main]) # XPath
page.findall(div, class=item) # BS4风格
page.findbytext(keyword) # 文本搜索
page.css(.item, adaptive=True) # 自适应（应对页面改版）

高级功能

- 自适应追踪：首次运行使用 auto_save=True，后续使用 adaptive=True——即使网站改版也能找到元素
代理轮换：传入 proxy=http://host:port 或使用 ProxyRotator
会话管理：FetcherSession、StealthySession、DynamicSession 用于Cookie/状态持久化
爬虫框架：类似Scrapy的并发爬取，支持暂停/恢复
异步支持：所有抓取器均有异步版本

完整API详情请参阅：references/api-reference.md

openclaw-ultra-scraping超强抓取