Scrapling Skill
Use the scrapling CLI to scrape websites with adaptive parsing and anti-bot bypass.
When to Use
✅ USE this skill when:
- - Scrape static or dynamic websites
- Bypass Cloudflare, captcha, or bot detection
- Extract structured data (HTML/JSON) from web pages
- Handle JavaScript-rendered content
- Get clean HTML without extra scripts/CSS
When NOT to Use
❌ DON'T use this skill when:
- - Simple HTTP requests → use INLINECODE1
- Need full browser automation → use
browser tool - API-based data → use direct API calls
- Local file processing → use file tools
Setup
CODEBLOCK0
Common Commands
Basic Scrape
CODEBLOCK1
With Headers/Timeouts
CODEBLOCK2
Extract Specific Elements
CODEBLOCK3
JSON Output with Fields
CODEBLOCK4
MCP Integration
Scrapling supports MCP (Model Context Protocol) for AI agents:
CODEBLOCK5
Then configure your agent to use the scrape tool via MCP.
Examples
Scrape News Article
CODEBLOCK6
Extract Product Data
CODEBLOCK7
Handle Cloudflare
CODEBLOCK8
Notes
- - Default timeout: 10 seconds
- Auto-detects best output format (html/json/text)
- Handles dynamic content via headless browser when needed
- Rate limit friendly; add delays between requests
JSON Output Format
CODEBLOCK9
Use the scrapling CLI to scrape websites with adaptive parsing and anti-bot bypass.
When to Use
✅ USE this skill when:
- - Scrape static or dynamic websites
- Bypass Cloudflare, captcha, or bot detection
- Extract structured data (HTML/JSON) from web pages
- Handle JavaScript-rendered content
- Get clean HTML without extra scripts/CSS
When NOT to Use
❌ DON'T use this skill when:
- - Simple HTTP requests → use INLINECODE5
- Need full browser automation → use
browser tool - API-based data → use direct API calls
- Local file processing → use file tools
Setup
CODEBLOCK10
Common Commands
Basic Scrape
CODEBLOCK11
With Headers/Timeouts
CODEBLOCK12
Extract Specific Elements
CODEBLOCK13
JSON Output with Fields
CODEBLOCK14
MCP Integration
Scrapling supports MCP (Model Context Protocol) for AI agents:
CODEBLOCK15
Then configure your agent to use the scrape tool via MCP.
Examples
Scrape News Article
CODEBLOCK16
Extract Product Data
CODEBLOCK17
Handle Cloudflare
CODEBLOCK18
Notes
- - Default timeout: 10 seconds
- Auto-detects best output format (html/json/text)
- Handles dynamic content via headless browser when needed
- Rate limit friendly; add delays between requests
JSON Output Format
CODEBLOCK19
Scrapling 技能
使用 scrapling 命令行工具,通过自适应解析和反爬绕过技术来抓取网站。
使用时机
✅ 在以下情况使用此技能:
- - 抓取静态或动态网站
- 绕过 Cloudflare、验证码或机器人检测
- 从网页中提取结构化数据(HTML/JSON)
- 处理 JavaScript 渲染的内容
- 获取不含多余脚本/CSS 的干净 HTML
不宜使用的情况
❌ 在以下情况不要使用此技能:
- - 简单的 HTTP 请求 → 使用 web_fetch
- 需要完整的浏览器自动化 → 使用 browser 工具
- 基于 API 的数据 → 使用直接 API 调用
- 本地文件处理 → 使用文件工具
安装配置
bash
安装 CLI
pipx install scrapling
scrapling --version
常用命令
基础抓取
bash
获取干净的 HTML
scrapling https://example.com -o html
获取 JSON 结构
scrapling https://example.com -o json
保存到文件
scrapling https://example.com -o output.html
设置请求头/超时
bash
自定义请求头
scrapling https://example.com --headers User-Agent: Mozilla/5.0
超时时间(秒)
scrapling https://slow-site.com --timeout 30
提取特定元素
bash
XPath 提取
scrapling https://example.com -e //div[@class=content] -o html
CSS 选择器
scrapling https://example.com -e div.content -o html
带字段的 JSON 输出
bash
提取标题、元描述
scrapling https://example.com \
--fields title,meta_description \
-o json
MCP 集成
Scrapling 支持用于 AI 代理的 MCP(模型上下文协议):
bash
启动 MCP 服务器
scrapling mcp start
然后配置你的代理通过 MCP 使用 scrape 工具。
示例
抓取新闻文章
bash
scrapling https://example.com/news/article-123 \
--fields title,author,publish_date,content \
-o json
提取产品数据
bash
scrapling https://shop.example.com/products \
-e //div[@class=product] \
-o html
处理 Cloudflare
bash
Scrapling 自动绕过大多数保护措施
scrapling https://protected-site.com -o html
注意事项
- - 默认超时:10 秒
- 自动检测最佳输出格式(html/json/text)
- 必要时通过无头浏览器处理动态内容
- 对速率限制友好;在请求之间添加延迟
JSON 输出格式
json
{
title: 页面标题,
meta_description: 描述文本,
content: <干净的 HTML>,
links: [http://..., ...],
images: [{src: ..., alt: ...}]
}