URL to Markdown
Fetches any URL via baoyu-fetch CLI (Chrome CDP + site-specific adapters) and converts it to clean markdown.
CLI Setup
Important: The CLI source is vendored in the scripts/vendor/baoyu-fetch/ subdirectory of this skill.
Agent Execution Instructions:
- 1. Determine this SKILL.md file's directory path as INLINECODE2
- CLI entry point = INLINECODE3
- Resolve
${BUN_X} runtime: if bun installed → bun; if npx available → npx -y bun; else suggest installing bun - INLINECODE9 = INLINECODE10
- Replace all
${READER} in this document with the resolved value
Preferences (EXTEND.md)
Check EXTEND.md existence (priority order):
CODEBLOCK0
CODEBLOCK1
| Path | Location |
|---|
| INLINECODE12 | Project directory |
| INLINECODE13 |
User home |
| Result | Action |
|---|
| Found | Read, parse, apply settings |
| Not found |
MUST run first-time setup (see below) — do NOT silently create defaults |
EXTEND.md Supports: Download media by default | Default output directory
First-Time Setup (BLOCKING)
CRITICAL: When EXTEND.md is not found, you MUST use AskUserQuestion to ask the user for their preferences before creating EXTEND.md. NEVER create EXTEND.md with defaults without asking. This is a BLOCKING operation — do NOT proceed with any conversion until setup is complete.
Use AskUserQuestion with ALL questions in ONE call:
Question 1 — header: "Media", question: "How to handle images and videos in pages?"
- - "Ask each time (Recommended)" — After saving markdown, ask whether to download media
- "Always download" — Always download media to local imgs/ and videos/ directories
- "Never download" — Keep original remote URLs in markdown
Question 2 — header: "Output", question: "Default output directory?"
- - "url-to-markdown (Recommended)" — Save to ./url-to-markdown/{domain}/{slug}.md
- (User may choose "Other" to type a custom path)
Question 3 — header: "Save", question: "Where to save preferences?"
- - "User (Recommended)" — ~/.baoyu-skills/ (all projects)
- "Project" — .baoyu-skills/ (this project only)
After user answers, create EXTEND.md at the chosen location, confirm "Preferences saved to [path]", then continue.
Full reference: references/config/first-time-setup.md
Supported Keys
| Key | Default | Values | Description |
|---|
| INLINECODE16 | INLINECODE17 | INLINECODE18 / 1 / INLINECODE20 | INLINECODE21 = prompt each time, 1 = always download, 0 = never |
| INLINECODE24 |
empty | path or empty | Default output directory (empty =
./url-to-markdown/) |
EXTEND.md → CLI mapping:
| EXTEND.md key | CLI argument | Notes |
|---|
| INLINECODE26 | INLINECODE27 | Requires --output to be set |
| INLINECODE29 |
Agent constructs
--output ./posts/{domain}/{slug}.md | Agent generates path, not a direct CLI flag |
Value priority:
- 1. CLI arguments (
--download-media, --output) - EXTEND.md
- Skill defaults
Features
- - Chrome CDP for full JavaScript rendering via
baoyu-fetch CLI - Site-specific adapters: X/Twitter, YouTube, Hacker News, generic (Defuddle)
- Automatic adapter selection based on URL, or force with INLINECODE34
- Interaction gate detection: Cloudflare, reCAPTCHA, hCAPTCHA, custom challenges
- Two capture modes: headless (default) or interactive with wait-for-interaction
- Clean markdown output with YAML front matter
- Structured JSON output available via INLINECODE35
- X/Twitter: extracts tweets, threads, and X Articles with media
- YouTube: transcript/caption extraction, chapters, cover images
- Hacker News: threaded comment parsing with proper nesting
- Generic: Defuddle extraction with Readability fallback
- Download images and videos to local directories
- Chrome profile persistence for authenticated sessions
- Debug artifact output for troubleshooting
Usage
CODEBLOCK2
Options
| Option | Description |
|---|
| INLINECODE36 | URL to fetch |
| INLINECODE37 |
Output file path (default: stdout) |
|
--format <type> | Output format:
markdown (default) or
json |
|
--json | Shorthand for
--format json |
|
--adapter <name> | Force adapter:
x,
youtube,
hn, or
generic (default: auto-detect) |
|
--headless | Force headless Chrome (no visible window) |
|
--wait-for <mode> | Interaction wait mode:
none (default),
interaction, or
force |
|
--wait-for-interaction | Alias for
--wait-for interaction |
|
--wait-for-login | Alias for
--wait-for interaction |
|
--timeout <ms> | Page load timeout (default: 30000) |
|
--interaction-timeout <ms> | Login/CAPTCHA wait timeout (default: 600000 = 10 min) |
|
--interaction-poll-interval <ms> | Poll interval for interaction checks (default: 1500) |
|
--download-media | Download images/videos to local
imgs/ and
videos/, rewrite markdown links. Requires
--output |
|
--media-dir <dir> | Base directory for downloaded media (default: same as
--output directory) |
|
--cdp-url <url> | Reuse existing Chrome DevTools Protocol endpoint |
|
--browser-path <path> | Custom Chrome/Chromium binary path |
|
--chrome-profile-dir <path> | Chrome user data directory (default:
BAOYU_CHROME_PROFILE_DIR env or
./baoyu-skills/chrome-profile) |
|
--debug-dir <dir> | Write debug artifacts (document.json, markdown.md, page.html, network.json) |
Capture Modes
| Mode | Behavior | Use When |
|---|
| Default | Headless Chrome, auto-extract on network idle | Public pages, static content |
| INLINECODE72 |
Explicit headless (same as default) | Clarify intent |
|
--wait-for interaction | Opens visible Chrome, auto-detects login/CAPTCHA gates, waits for them to clear, then continues | Login-required, CAPTCHA-protected |
|
--wait-for force | Opens visible Chrome, auto-detects OR accepts Enter keypress to continue | Complex flows, lazy loading, paywalls |
Interaction gate auto-detection:
- - Cloudflare Turnstile / "just a moment" pages
- Google reCAPTCHA
- hCaptcha
- Custom challenge / verification screens
Wait-for-interaction workflow:
- 1. Run with
--wait-for interaction → Chrome opens visibly - CLI auto-detects login/CAPTCHA gates
- User completes login or solves CAPTCHA in the browser
- CLI auto-detects gate cleared → captures page
- If
--wait-for force is used, user can also press Enter to trigger capture manually
Agent Quality Gate
CRITICAL: The agent must treat default headless capture as provisional. Some sites render differently in headless mode and can silently return low-quality content without causing the CLI to fail.
After every headless run, the agent MUST inspect the saved markdown output.
Quality checks the agent must perform
- 1. Confirm the markdown title matches the target page, not a generic site shell
- Confirm the body contains the expected article or page content, not just navigation, footer, or a generic error
- Watch for obvious failure signs:
-
Application error
-
This page could not be found
- Login, signup, subscribe, or verification shells
- Extremely short markdown for a page that should be long-form
- Raw framework payloads or mostly boilerplate content
- 4. If the result is low quality, incomplete, or clearly wrong, do not accept the run as successful just because the CLI exited with code 0
Tip: Use --format json to get structured output including status, login.state, and interaction fields for programmatic quality assessment. A "status": "needs_interaction" response means the page requires manual interaction.
Recovery workflow the agent must follow
- 1. First run headless (default) unless there is already a clear reason to use interaction mode
- Review markdown quality immediately after the run
- If the content is low quality or indicates login/CAPTCHA:
-
--wait-for interaction for auto-detected gates (login, CAPTCHA, Cloudflare)
-
--wait-for force when the page needs manual browsing, scroll loading, or complex interaction
- 4. If
--wait-for is used, tell the user exactly what to do:
- If login is required, ask them to sign in in the browser
- If CAPTCHA appears, ask them to solve it
- If the page needs time to load, ask them to wait until content is visible
- For
--wait-for force: tell them to press Enter when ready
- 5. If JSON output shows
"status": "needs_interaction", switch to --wait-for interaction automatically
Output Path Generation
The agent must construct the output file path since baoyu-fetch does not auto-generate paths.
Algorithm:
- 1. Determine base directory from EXTEND.md
default_output_dir or default INLINECODE92 - Extract domain from URL (e.g.,
example.com) - Generate slug from URL path or page title (kebab-case, 2-6 words)
- Construct:
{base_dir}/{domain}/{slug}/{slug}.md — each URL gets its own directory so media files stay isolated - Conflict resolution: append timestamp INLINECODE95
Pass the constructed path to --output. Media files (--download-media) are saved into subdirectories next to the markdown file, keeping each URL's assets self-contained.
Output Format
Markdown output to stdout (or file with --output) as clean markdown text.
JSON output (--format json) returns structured data including:
- -
adapter — which adapter handled the URL - INLINECODE101 —
"ok" or INLINECODE103 - INLINECODE104 — login state detection (
logged_in, logged_out, unknown) - INLINECODE108 — interaction gate details (kind, provider, prompt)
- INLINECODE109 — structured content (url, title, author, publishedAt, content blocks, metadata)
- INLINECODE110 — collected media assets with url, kind, role
- INLINECODE111 — converted markdown text
- INLINECODE112 — media download results (when
--download-media used)
When --download-media is enabled:
- - Images are saved to
imgs/ next to the output file (or in --media-dir) - Videos are saved to
videos/ next to the output file (or in --media-dir) - Markdown media links are rewritten to local relative paths
Built-in Adapters
| Adapter | URLs | Key Features |
|---|
| INLINECODE119 | x.com, twitter.com | Tweets, threads, X Articles, media, login detection |
| INLINECODE120 |
youtube.com, youtu.be | Transcript/captions, chapters, cover image, metadata |
|
hn | news.ycombinator.com | Threaded comments, story metadata, nested replies |
|
generic | Any URL (fallback) | Defuddle extraction, Readability fallback, auto-scroll, network idle detection |
Adapter is auto-selected based on URL. Use --adapter <name> to override.
Media Download Workflow
Based on download_media setting in EXTEND.md:
| Setting | Behavior |
|---|
| INLINECODE125 (always) | Run CLI with INLINECODE126 |
| INLINECODE127 (never) |
Run CLI with
--output <path> (no media download) |
|
ask (default) | Follow the ask-each-time flow below |
Ask-Each-Time Flow
- 1. Run CLI without
--download-media with --output <path> → markdown saved - Check saved markdown for remote media URLs (
https:// in image/video links) - If no remote media found → done, no prompt needed
- If remote media found → use
AskUserQuestion:
- header: "Media", question: "Download N images/videos to local files?"
- "Yes" — Download to local directories
- "No" — Keep remote URLs
- 5. If user confirms → run CLI again with
--download-media --output <same-path> (overwrites markdown with localized links)
Environment Variables
| Variable | Description |
|---|
| INLINECODE135 | Chrome user data directory (can also use --chrome-profile-dir) |
Troubleshooting: Chrome not found → use --browser-path. Timeout → increase --timeout. Login/CAPTCHA pages → use --wait-for interaction. Debug → use --debug-dir to inspect captured HTML and network logs.
YouTube Notes
- - YouTube adapter extracts transcripts/captions automatically when available
- Transcript format:
[MM:SS] Text segment with chapter headings - Transcript availability depends on YouTube exposing a caption track. Videos with captions disabled or restricted playback may produce description-only output
- Use
--wait-for force if the page needs time to finish loading player metadata
X/Twitter Notes
- - Extracts single tweets, threads, and X Articles
- Auto-detects login state; if logged out and content requires auth, JSON output will show INLINECODE143
- Use
--wait-for interaction for login-protected content
Hacker News Notes
- - Parses threaded comments with proper nesting and reply hierarchy
- Includes story metadata (title, URL, author, score, comment count)
- Shows comment deletion/dead status
Extension Support
Custom configurations via EXTEND.md. See Preferences section for paths and supported options.
URL 转 Markdown
通过 baoyu-fetch CLI(Chrome CDP + 站点特定适配器)获取任意 URL,并将其转换为干净的 Markdown 格式。
CLI 配置
重要提示:CLI 源代码已内置于本技能的 scripts/vendor/baoyu-fetch/ 子目录中。
代理执行说明:
- 1. 确定本 SKILL.md 文件所在的目录路径为 {baseDir}
- CLI 入口点 = {baseDir}/scripts/vendor/baoyu-fetch/src/cli.ts
- 解析 ${BUNX} 运行时:如果已安装 bun → 使用 bun;如果可用 npx → 使用 npx -y bun;否则建议安装 bun
- ${READER} = ${BUNX} {baseDir}/scripts/vendor/baoyu-fetch/src/cli.ts
- 将本文档中所有 ${READER} 替换为解析后的值
偏好设置(EXTEND.md)
检查 EXTEND.md 是否存在(优先级顺序):
bash
macOS、Linux、WSL、Git Bash
test -f .baoyu-skills/baoyu-url-to-markdown/EXTEND.md && echo project
test -f ${XDG
CONFIGHOME:-$HOME/.config}/baoyu-skills/baoyu-url-to-markdown/EXTEND.md && echo xdg
test -f $HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md && echo user
powershell
PowerShell(Windows)
if (Test-Path .baoyu-skills/baoyu-url-to-markdown/EXTEND.md) { project }
$xdg = if ($env:XDG
CONFIGHOME) { $env:XDG
CONFIGHOME } else { $HOME/.config }
if (Test-Path $xdg/baoyu-skills/baoyu-url-to-markdown/EXTEND.md) { xdg }
if (Test-Path $HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md) { user }
| 路径 | 位置 |
|---|
| .baoyu-skills/baoyu-url-to-markdown/EXTEND.md | 项目目录 |
| $HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md |
用户主目录 |
必须运行首次设置(见下文)—— 不要静默创建默认值 |
EXTEND.md 支持:默认下载媒体 | 默认输出目录
首次设置(阻塞操作)
关键提示:当未找到 EXTEND.md 时,你必须使用 AskUserQuestion 询问用户的偏好,然后再创建 EXTEND.md。切勿在未询问的情况下使用默认值创建 EXTEND.md。这是一个阻塞操作 —— 在设置完成之前,不要进行任何转换。
使用 AskUserQuestion 在一次调用中提出所有问题:
问题 1 — 标题:媒体,问题:如何处理页面中的图片和视频?
- - 每次都询问(推荐) — 保存 Markdown 后,询问是否下载媒体
- 始终下载 — 始终将媒体下载到本地 imgs/ 和 videos/ 目录
- 从不下载 — 在 Markdown 中保留原始远程 URL
问题 2 — 标题:输出,问题:默认输出目录?
- - url-to-markdown(推荐) — 保存到 ./url-to-markdown/{domain}/{slug}.md
- (用户可选择其他以输入自定义路径)
问题 3 — 标题:保存,问题:偏好设置保存位置?
- - 用户(推荐) — ~/.baoyu-skills/(所有项目)
- 项目 — .baoyu-skills/(仅此项目)
用户回答后,在所选位置创建 EXTEND.md,确认偏好设置已保存至 [路径],然后继续。
完整参考:references/config/first-time-setup.md
支持的键
| 键 | 默认值 | 值 | 描述 |
|---|
| downloadmedia | ask | ask / 1 / 0 | ask = 每次提示,1 = 始终下载,0 = 从不 |
| defaultoutput_dir |
空 | 路径或空 | 默认输出目录(空 = ./url-to-markdown/) |
EXTEND.md → CLI 映射:
| EXTEND.md 键 | CLI 参数 | 备注 |
|---|
| downloadmedia: 1 | --download-media | 需要同时设置 --output |
| defaultoutput_dir: ./posts/ |
代理构建 --output ./posts/{domain}/{slug}.md | 代理生成路径,非直接 CLI 标志 |
值优先级:
- 1. CLI 参数(--download-media、--output)
- EXTEND.md
- 技能默认值
功能特性
- - 通过 baoyu-fetch CLI 使用 Chrome CDP 实现完整的 JavaScript 渲染
- 站点特定适配器:X/Twitter、YouTube、Hacker News、通用(Defuddle)
- 基于 URL 自动选择适配器,或使用 --adapter 强制指定
- 交互门检测:Cloudflare、reCAPTCHA、hCAPTCHA、自定义挑战
- 两种捕获模式:无头模式(默认)或带等待交互的交互模式
- 带有 YAML 前置元数据的干净 Markdown 输出
- 通过 --format json 支持结构化 JSON 输出
- X/Twitter:提取推文、线程和带媒体的 X 文章
- YouTube:提取转录/字幕、章节、封面图片
- Hacker News:带正确嵌套的线程化评论解析
- 通用:Defuddle 提取,附带 Readability 回退
- 将图片和视频下载到本地目录
- Chrome 配置文件持久化,支持已认证会话
- 调试工件输出,便于故障排查
使用方法
bash
默认:无头捕获,Markdown 输出到标准输出
${READER}
保存到文件
${READER} --output article.md
保存并下载媒体
${READER} --output article.md --download-media
无头模式(显式)
${READER} --headless --output article.md
等待交互(登录/CAPTCHA)— 自动检测并继续
${READER} --wait-for interaction --output article.md
等待交互 — 手动控制(按 Enter 继续)
${READER} --wait-for force --output article.md
JSON 输出
${READER} --format json --output article.json
强制指定适配器
${READER} --adapter youtube --output transcript.md
连接到现有 Chrome
${READER} --cdp-url http://localhost:9222 --output article.md
调试工件
${READER} --output article.md --debug-dir ./debug/
选项
| 选项 | 描述 |
|---|
| <url> | 要获取的 URL |
| --output <path> |
输出文件路径(默认:标准输出) |
| --format | 输出格式:markdown(默认)或 json |
| --json | --format json 的简写 |
| --adapter | 强制适配器:x、youtube、hn 或 generic(默认:自动检测) |
| --headless | 强制无头 Chrome(无可见窗口) |
| --wait-for | 交互等待模式:none(默认)、interaction 或 force |
| --wait-for-interaction | --wait-for interaction 的别名 |
| --wait-for-login | --wait-for interaction 的别名 |
| --timeout | 页面加载超时(默认:30000) |
| --interaction-timeout | 登录/CAPTCHA 等待超时(默认:600000 = 10 分钟) |
| --interaction-poll-interval | 交互检查轮询间隔(默认:1500) |
| --download-media | 将图片/视频下载到本地 imgs/ 和 videos/,重写 Markdown 链接。需要