OpenClaw Ultra Scraping
Powered by MyClaw.ai — the AI personal assistant platform that gives every user a full server with complete code control. Part of the MyClaw open skills ecosystem.
Handles everything from single-page extraction to full-scale concurrent crawls with anti-bot bypass.
Setup
Run once before first use:
CODEBLOCK0
This installs Scrapling + all browser dependencies into /opt/scrapling-venv.
Quick Start — CLI Script
The bundled scripts/scrape.py provides a unified CLI:
CODEBLOCK1
Quick Start — Python
For complex tasks, write Python directly using the venv:
CODEBLOCK2
Fetcher Selection Guide
| Scenario | Fetcher | Flag |
|---|
| Normal sites, fast scraping | INLINECODE2 | (default) |
| JS-rendered SPAs |
DynamicFetcher |
--dynamic |
| Cloudflare/anti-bot protected |
StealthyFetcher |
--stealth |
| Cloudflare Turnstile challenge |
StealthyFetcher |
--stealth --solve-cloudflare |
Selector Cheat Sheet
CODEBLOCK3
Advanced Features
- - Adaptive tracking:
auto_save=True on first run, adaptive=True later — elements are found even after site redesign - Proxy rotation: Pass
proxy="http://host:port" or use INLINECODE12 - Sessions:
FetcherSession, StealthySession, DynamicSession for cookie/state persistence - Spider framework: Scrapy-like concurrent crawling with pause/resume
- Async support: All fetchers have async variants
For full API details: read INLINECODE16
OpenClaw Ultra Scraping
由 MyClaw.ai 提供技术支持——这是一款AI个人助手平台,为每位用户提供拥有完整代码控制的专属服务器。属于 MyClaw开放技能生态系统 的一部分。
处理从单页面提取到大规模并发爬取的一切任务,并具备反机器人绕过功能。
安装
首次使用前运行一次:
bash
bash scripts/setup.sh
此操作将 Scrapling 及所有浏览器依赖安装到 /opt/scrapling-venv 目录。
快速入门 — CLI 脚本
附带的 scripts/scrape.py 提供了统一的命令行接口:
bash
PYTHON=/opt/scrapling-venv/bin/python3
简单抓取(JSON输出)
$PYTHON scripts/scrape.py fetch https://example.com --css .content
提取文本
$PYTHON scripts/scrape.py extract https://example.com --css h1
隐身模式(绕过Cloudflare)
$PYTHON scripts/scrape.py fetch https://protected-site.com --stealth --solve-cloudflare --css .data
动态模式(完整浏览器渲染)
$PYTHON scripts/scrape.py fetch https://spa-site.com --dynamic --css .product
提取链接
$PYTHON scripts/scrape.py links https://example.com --filter \.pdf$
多页面爬取
$PYTHON scripts/scrape.py crawl https://example.com --depth 2 --concurrency 10 --css .item -o results.json
输出格式:json, jsonl, csv, text, markdown, html
$PYTHON scripts/scrape.py fetch https://example.com -f markdown -o page.md
快速入门 — Python
对于复杂任务,可直接使用虚拟环境编写Python代码:
python
#!/opt/scrapling-venv/bin/python3
from scrapling.fetchers import Fetcher, StealthyFetcher
简单HTTP请求
page = Fetcher.get(https://example.com, impersonate=chrome)
titles = page.css(h1::text).getall()
绕过Cloudflare
page = StealthyFetcher.fetch(https://protected.com, headless=True, solve_cloudflare=True)
data = page.css(.product).getall()
抓取器选择指南
| 场景 | 抓取器 | 标志 |
|---|
| 普通网站,快速抓取 | Fetcher | (默认) |
| JS渲染的SPA应用 |
DynamicFetcher | --dynamic |
| Cloudflare/反机器人保护 | StealthyFetcher | --stealth |
| Cloudflare Turnstile验证 | StealthyFetcher | --stealth --solve-cloudflare |
选择器速查表
python
page.css(.class) # CSS选择器
page.css(.class::text).getall() # 文本提取
page.xpath(//div[@id=main]) # XPath
page.findall(div, class=item) # BS4风格
page.findbytext(keyword) # 文本搜索
page.css(.item, adaptive=True) # 自适应(应对页面改版)
高级功能
- - 自适应追踪:首次运行使用 auto_save=True,后续使用 adaptive=True——即使网站改版也能找到元素
- 代理轮换:传入 proxy=http://host:port 或使用 ProxyRotator
- 会话管理:FetcherSession、StealthySession、DynamicSession 用于Cookie/状态持久化
- 爬虫框架:类似Scrapy的并发爬取,支持暂停/恢复
- 异步支持:所有抓取器均有异步版本
完整API详情请参阅:references/api-reference.md