Web Fetcher
Smart web content fetcher for Claude Code. Automatically detects platform and uses the best strategy to fetch articles or download videos.
Quick Start
CODEBLOCK0
Install Dependencies
Install only what you need — dependencies are checked at runtime:
| Dependency | Purpose | Install |
|---|
| scrapling | Article fetching (HTTP + browser) | INLINECODE0 |
| yt-dlp |
Video download |
pip install yt-dlp |
| camoufox | Anti-detection browser (Xiaohongshu, Weibo) |
pip install camoufox && python3 -m camoufox fetch |
| html2text | HTML to Markdown conversion |
pip install html2text |
Smart Routing
The fetcher automatically detects the platform from the URL:
| Platform | Method | Notes |
|---|
| mp.weixin.qq.com | scrapling | Extracts data-src images, handles SVG placeholders |
| *.feishu.cn |
Virtual scroll | Collects all blocks via scrolling, downloads images with cookies |
| zhuanlan.zhihu.com | scrapling |
.Post-RichText selector |
| www.zhihu.com | scrapling |
.RichContent selector |
| www.toutiao.com | scrapling | Handles
toutiaoimg.com base64 placeholders |
| www.xiaohongshu.com | camoufox | Anti-bot protection requires stealth browser |
| www.weibo.com | camoufox | Anti-bot protection requires stealth browser |
| bilibili.com / b23.tv | yt-dlp | Video download, supports quality selection |
| youtube.com / youtu.be | yt-dlp | Video download |
| douyin.com | yt-dlp | Video download |
| Unknown URLs | scrapling | Generic fetch with fallback tiers |
CLI Reference
CODEBLOCK1
Platform Notes
WeChat (mp.weixin.qq.com)
- - Images use
data-src attribute with mmbiz.qpic.cn URLs - Visible
<img> tags contain SVG placeholders (lazy loading) - Image download requires
Referer: https://mp.weixin.qq.com/ header - Scrapling GET usually works; no browser needed
Feishu (*.feishu.cn)
- - Uses virtual scroll — content blocks are rendered on-demand
- The fetcher scrolls through the entire document, collecting
[data-block-id] elements - Images require authenticated fetch (cookies), downloaded via browser's fetch API
- May show "Unable to print" artifacts which are auto-cleaned
Bilibili
- - Short links (b23.tv) are auto-resolved
- For premium/member content, use INLINECODE13
- Default quality is 1080p, adjustable with INLINECODE14
Troubleshooting
| Problem | Solution |
|---|
| INLINECODE15 | INLINECODE16 |
| INLINECODE17 |
pip install yt-dlp |
| Article content too short | Try
--method camoufox for JS-heavy pages |
| Feishu returns login page | The doc may require authentication |
| Bilibili 403 | Use
--cookies-browser chrome |
| Image download fails | Check network; WeChat images need Referer header (auto-handled) |
Manual Usage
When the CLI doesn't fit your needs, use the modules directly:
CODEBLOCK2
技能名称: web-fetcher
详细描述:
Web 抓取器
适用于 Claude Code 的智能网页内容抓取工具。自动检测平台并使用最佳策略抓取文章或下载视频。
快速开始
bash
抓取文章
python3 {SKILL_DIR}/fetcher.py URL -o ~/docs/
下载视频
python3 {SKILL_DIR}/fetcher.py https://b23.tv/xxx -o ~/videos/
从文件批量抓取
python3 {SKILL_DIR}/fetcher.py --urls-file urls.txt -o ~/docs/
安装依赖
仅安装所需内容——依赖项在运行时检查:
| 依赖项 | 用途 | 安装命令 |
|---|
| scrapling | 文章抓取(HTTP + 浏览器) | pip install scrapling |
| yt-dlp |
视频下载 | pip install yt-dlp |
| camoufox | 反检测浏览器(小红书、微博) | pip install camoufox && python3 -m camoufox fetch |
| html2text | HTML 转 Markdown 转换 | pip install html2text |
智能路由
抓取工具自动根据 URL 检测平台:
| 平台 | 方法 | 备注 |
|---|
| mp.weixin.qq.com | scrapling | 提取 data-src 图片,处理 SVG 占位符 |
| *.feishu.cn |
虚拟滚动 | 通过滚动收集所有区块,使用 cookies 下载图片 |
| zhuanlan.zhihu.com | scrapling | .Post-RichText 选择器 |
| www.zhihu.com | scrapling | .RichContent 选择器 |
| www.toutiao.com | scrapling | 处理 toutiaoimg.com base64 占位符 |
| www.xiaohongshu.com | camoufox | 反爬虫保护需要隐身浏览器 |
| www.weibo.com | camoufox | 反爬虫保护需要隐身浏览器 |
| bilibili.com / b23.tv | yt-dlp | 视频下载,支持画质选择 |
| youtube.com / youtu.be | yt-dlp | 视频下载 |
| douyin.com | yt-dlp | 视频下载 |
| 未知 URL | scrapling | 通用抓取,带降级策略 |
CLI 参考
python3 {SKILL_DIR}/fetcher.py [URL] [选项]
参数:
url 要抓取的 URL
选项:
-o, --output 目录 输出目录(默认:当前目录)
-q, --quality 画质 视频画质,例如 1080、720(默认:1080)
--method 方法 强制指定方法:scrapling、camoufox、ytdlp、feishu
--selector CSS 选择器 强制指定内容提取的 CSS 选择器
--urls-file 文件 包含 URL 的文件(每行一个,# 表示注释)
--audio-only 仅提取音频(视频下载)
--no-images 跳过图片下载(文章)
--cookies-browser 浏览器 用于 cookies 的浏览器(例如 chrome、firefox)
平台说明
微信(mp.weixin.qq.com)
- - 图片使用 data-src 属性,URL 为 mmbiz.qpic.cn
- 可见的
标签包含 SVG 占位符(懒加载) - 图片下载需要 Referer: https://mp.weixin.qq.com/ 请求头
- Scrapling 的 GET 请求通常有效;无需浏览器
飞书(*.feishu.cn)
- - 使用虚拟滚动——内容块按需渲染
- 抓取工具滚动整个文档,收集 [data-block-id] 元素
- 图片需要经过身份验证的抓取(cookies),通过浏览器的 fetch API 下载
- 可能显示无法打印的残留内容,会自动清理
Bilibili
- - 短链接(b23.tv)会自动解析
- 对于付费/会员内容,使用 --cookies-browser chrome
- 默认画质为 1080p,可通过 -q 调整
故障排除
| 问题 | 解决方案 |
|---|
| scrapling not found | pip install scrapling |
| yt-dlp not found |
pip install yt-dlp |
| 文章内容过短 | 对 JS 密集型页面尝试 --method camoufox |
| 飞书返回登录页面 | 文档可能需要身份验证 |
| Bilibili 返回 403 | 使用 --cookies-browser chrome |
| 图片下载失败 | 检查网络;微信图片需要 Referer 请求头(自动处理) |
手动使用
当 CLI 不满足需求时,可直接使用模块:
python
from lib.router import route, check_dependency
from lib.article import fetch_article
from lib.video import fetch_video
from lib.feishu import fetch_feishu
路由 URL
r = route(https://mp.weixin.qq.com/s/xxx)
{type: article, method: scrapling, selector: #jscontent, post: wximages}
抓取文章
fetch
article(url, outputdir=/tmp/out, route_config=r)
下载视频
fetch
video(url, outputdir=/tmp/out, quality=720)
抓取飞书文档
fetch
feishu(url, outputdir=/tmp/out)