website-scraper-pro网页抓取工具

Run a local script to scrape a single web page into clean markdown or deterministic JSON with Crawl4AI. Use when: user needs direct page retrieval from a URL, JS-aware single-page scraping, or deterministic query-focused narrowing without internal AI processing. Invoke by reading this SKILL.md then running: uv run /root/.openclaw/workspace/skills/website-scraper-pro/src/main.py

作者: admin | 来源: ClawHub

Skill: Website Scraper Pro

When to use

- The user wants the content of a single web page from a specific URL.
The user wants clean markdown extracted from an article, docs page, blog post, or landing page.
The user wants a JS-aware scrape for a page that depends on client-side rendering.
The user wants deterministic query-focused narrowing of one page without using an AI model inside the skill.
The user wants structured JSON output with markdown, title, links, and metadata.

When NOT to use

- The user wants a broad web search across multiple sources.
The user wants a site-wide crawl, recursive crawl, or multi-page extraction workflow.
The user wants AI summarization, synthesis, or answer generation inside the scraper itself.
The user wants authenticated browser automation or interactive form submission.

Commands

Scrape a page to markdown

CODEBLOCK0

Scrape a JS-heavy page

CODEBLOCK1

Scrape a page and narrow by query

CODEBLOCK2

Return deterministic JSON

CODEBLOCK3

Examples

CODEBLOCK4

Output

- Default output is clean markdown for a single page.
INLINECODE0 keeps the output deterministic and non-LLM.
INLINECODE1 returns deterministic JSON with fields such as title, url, markdown, links, and metadata when available.

Notes

- This v1 does not use AI models internally. It is a deterministic retrieval tool only.
The skill is single-page only. It does not do deep crawling, site maps, schema extraction, or RAG.
INLINECODE7 reads the inline # /// script dependency block in main.py and installs crawl4ai in an isolated environment.
If browser setup is missing, run one-time setup commands such as:

- uv run --with crawl4ai crawl4ai-setup - uv run --with crawl4ai python -m playwright install chromium

- Do NOT use web search for this workflow when a direct URL is available.
Call uv run src/main.py directly as shown above.

技能：Website Scraper Pro

使用场景

- 用户需要从特定URL获取单个网页的内容
用户希望从文章、文档页面、博客文章或落地页提取干净的Markdown格式内容
用户需要对依赖客户端渲染的页面进行支持JavaScript的抓取
用户希望在技能内部不使用AI模型的情况下，对单个页面进行确定性的查询聚焦筛选
用户需要包含Markdown、标题、链接和元数据的结构化JSON输出

禁止使用场景

- 用户需要进行跨多个来源的广泛网络搜索
用户需要进行全站爬取、递归爬取或多页面提取工作流
用户希望在抓取器内部进行AI摘要、综合或答案生成
用户需要进行经过身份验证的浏览器自动化或交互式表单提交

命令

将页面抓取为Markdown格式

bash
uv run /root/.openclaw/workspace/skills/website-scraper-pro/src/main.py

抓取JavaScript密集型页面

bash
uv run /root/.openclaw/workspace/skills/website-scraper-pro/src/main.py --js

抓取页面并按查询条件筛选

bash
uv run /root/.openclaw/workspace/skills/website-scraper-pro/src/main.py --query <文本>

返回确定性JSON

bash
uv run /root/.openclaw/workspace/skills/website-scraper-pro/src/main.py --format json

示例

bash

默认Markdown抓取

uv run /root/.openclaw/workspace/skills/website-scraper-pro/src/main.py https://example.com

支持JavaScript的抓取

uv run /root/.openclaw/workspace/skills/website-scraper-pro/src/main.py https://example.com --js

查询聚焦检索

uv run /root/.openclaw/workspace/skills/website-scraper-pro/src/main.py https://example.com --query 文档示例

JSON输出

uv run /root/.openclaw/workspace/skills/website-scraper-pro/src/main.py https://example.com --format json

输出

- 默认输出为单个页面的干净Markdown格式
--query 参数保持输出确定性和非LLM特性
--format json 返回确定性JSON，包含 title、url、markdown、links 和 metadata 等字段（如可用）

注意事项

- 此v1版本内部不使用AI模型，仅作为确定性检索工具
该技能仅支持单页面操作，不进行深度爬取、站点地图、模式提取或RAG
uv run 读取 main.py 中的内联 # /// script 依赖块，并在隔离环境中安装 crawl4ai
如果缺少浏览器设置，请运行一次性设置命令，例如：

- uv run --with crawl4ai crawl4ai-setup - uv run --with crawl4ai python -m playwright install chromium

- 当直接提供URL时，请勿对此工作流使用网络搜索
直接按上述方式调用 uv run src/main.py

website-scraper-pro网页抓取工具

website-scraper-pro

Skill: Website Scraper Pro

When to use

When NOT to use