返回顶部
w

webpage-export

Export webpages into clean local TXT, DOCX, and PDF files with source metadata, fallback extraction logic, and browser-assisted recovery for difficult pages. Useful for archiving articles, policy pages, WeChat posts, official notices, and other webpages before downstream analysis or sharing.

作者: admin | 来源: ClawHub
源自
ClawHub
版本
V 1.0.1
安全检测
已通过
127
下载量
0
收藏
概述
安装方式
版本历史

webpage-export

# Webpage Export Use this skill to turn a webpage URL into local files that downstream agents can archive, send, or reference. ## Core workflow 1. Run `scripts/export_webpage.py <url>` to create a TXT snapshot first. 2. Treat TXT as the baseline extracted record. 3. Add `--docx` when the user wants a Word document. 4. Add `--pdf` when Chrome/Chromium is available and the user wants a PDF. 5. Keep the generated JSON metadata file; it records extraction quality, paths, warnings, and partial-failure status for downstream agents. 6. Save outputs to an explicit `--outdir` when the user provides one; otherwise let the script use its local default export folder under the current working directory. 7. For accuracy-sensitive work, keep original title, original URL, and extracted source metadata. ## Commands ### TXT only ```bash python3 scripts/export_webpage.py "<url>" ``` ### TXT + DOCX ```bash python3 scripts/export_webpage.py "<url>" --docx ``` ### TXT + PDF ```bash python3 scripts/export_webpage.py "<url>" --pdf ``` ### TXT + DOCX + PDF with explicit output folder ```bash python3 scripts/export_webpage.py "<url>" --docx --pdf --outdir ./exports/temp ``` ## Runtime requirements - Requires `python3`. - Requires `curl` for baseline webpage fetching. - PDF export requires Chrome or Chromium. - Browser-assisted fallback requires `node` and the `playwright` package. - DOCX export on macOS requires `textutil`. ## Safety and execution notes - This skill fetches arbitrary URLs and may use a headless browser for difficult pages. - Browser-assisted fallback executes page JavaScript and should be used only when needed. - Prefer explicit `--outdir` values for production or shared environments. ## What the script does - Fetch the page with `curl` - Extract title/source/publish-time when available - Try multiple body candidates before falling back to a full-page text snapshot - Score extraction quality and emit warnings for suspicious/partial results - Strip HTML into readable text for a TXT snapshot - Convert TXT to DOCX using `textutil` on macOS - Render webpage to PDF using Chrome/Chromium headless printing when available - Emit a JSON metadata file with status, paths, word count, quality, and warnings ## Format choice - Prefer **TXT** as the baseline extracted record. - Prefer **DOCX** when the user wants an editable or shareable document. - Prefer **PDF** when the user wants page-like rendering or easier direct viewing. - For important work, do not treat PDF as the only source of truth. ## Chrome/Chromium PDF path When the user wants PDF, prefer Chrome/Chromium headless printing because it preserves Chinese text and webpage layout better than ad-hoc PDF generation. Read `references/chrome-pdf-guide.md` when: - you need the exact Chrome PDF logic - PDF output is incomplete or suspicious - Chrome emits warnings and you need to judge whether the result is still usable - you need fallback decisions ## Accuracy and fallbacks Read `references/accuracy-and-fallbacks.md` when: - source accuracy matters - webpage metadata is incomplete - a field cannot be extracted cleanly - you need fallback behavior after a partial extraction ## Delivery decisions Read `references/delivery-rules.md` when: - deciding whether to deliver TXT, DOCX, PDF, or a combination - preparing files for downstream agents or user delivery - choosing archive placement under the local workspace ## Limitations - Some highly dynamic or anti-bot pages may extract only partially. - PDF depends on Chrome/Chromium being installed. - DOCX depends on macOS `textutil`. - If a page is blocked in lightweight fetch mode, use this skill's curl-based extraction path before giving up. ## Accuracy rule Accuracy is the top standard. Keep original title, original URL, and extracted source metadata. If any field is uncertain, mark it as missing instead of guessing.

标签

skill ai

通过对话安装

该技能支持在以下平台通过对话安装:

OpenClaw WorkBuddy QClaw Kimi Claude

方式一:安装 SkillHub 和技能

帮我安装 SkillHub 和 webpage-export-1776062521 技能

方式二:设置 SkillHub 为优先技能安装源

设置 SkillHub 为我的优先技能安装源,然后帮我安装 webpage-export-1776062521 技能

通过命令行安装

skillhub install webpage-export-1776062521

下载 Zip 包

⬇ 下载 webpage-export v1.0.1

文件大小: 10.87 KB | 发布时间: 2026-4-17 16:33

v1.0.1 最新 2026-4-17 16:33
- Updates default output folder behavior: if no --outdir is specified, outputs now default to a local export folder under the current working directory.
- Adds runtime requirements section clarifying dependencies such as python3, curl, Chrome/Chromium, node, and playwright for various export functions.
- Adds safety and execution notes, particularly regarding headless browser usage and best practices for production environments.
- Example commands and documentation reflect new output and requirements, replacing hardcoded paths with generic, workspace-relative locations.
- No functional code changes—documentation update for improved clarity and user guidance.

Archiver·手机版·闲社网·闲社论坛·羊毛社区· 多链控股集团有限公司 · 苏ICP备2025199260号-1

Powered by Discuz! X5.0   © 2024-2025 闲社网·线报更新论坛·羊毛分享社区·http://xianshe.com

p2p_official_large
返回顶部