返回顶部
🇺🇸 English
🇨🇳 简体中文
🇨🇳 繁體中文
🇺🇸 English
🇯🇵 日本語
🇰🇷 한국어
🇫🇷 Français
🇩🇪 Deutsch
🇪🇸 Español
🇷🇺 Русский
s

scraper

Scrape documents from Notion, DocSend, PDFs, and other sources into local PDF files. Use when the user needs to download, archive, or convert web documents to PDF format. Supports authentication flows for protected documents and session persistence via profiles. Returns local file paths to downloaded PDFs.

作者: admin | 来源: ClawHub
源自
ClawHub
版本
V 0.0.1
安全检测
已通过
2,030
下载量
2
收藏
概述
安装方式
版本历史

scraper

# docs-scraper CLI tool that scrapes documents from various sources into local PDF files using browser automation. ## Installation ```bash npm install -g docs-scraper ``` ## Quick start Scrape any document URL to PDF: ```bash docs-scraper scrape https://example.com/document ``` Returns local path: `~/.docs-scraper/output/1706123456-abc123.pdf` ## Basic scraping **Scrape with daemon** (recommended, keeps browser warm): ```bash docs-scraper scrape <url> ``` **Scrape with named profile** (for authenticated sites): ```bash docs-scraper scrape <url> -p <profile-name> ``` **Scrape with pre-filled data** (e.g., email for DocSend): ```bash docs-scraper scrape <url> -D email=user@example.com ``` **Direct mode** (single-shot, no daemon): ```bash docs-scraper scrape <url> --no-daemon ``` ## Authentication workflow When a document requires authentication (login, email verification, passcode): 1. Initial scrape returns a job ID: ```bash docs-scraper scrape https://docsend.com/view/xxx # Output: Scrape blocked # Job ID: abc123 ``` 2. Retry with data: ```bash docs-scraper update abc123 -D email=user@example.com # or with password docs-scraper update abc123 -D email=user@example.com -D password=1234 ``` ## Profile management Profiles store session cookies for authenticated sites. ```bash docs-scraper profiles list # List saved profiles docs-scraper profiles clear # Clear all profiles docs-scraper scrape <url> -p myprofile # Use a profile ``` ## Daemon management The daemon keeps browser instances warm for faster scraping. ```bash docs-scraper daemon status # Check status docs-scraper daemon start # Start manually docs-scraper daemon stop # Stop daemon ``` Note: Daemon auto-starts when running scrape commands. ## Cleanup PDFs are stored in `~/.docs-scraper/output/`. The daemon automatically cleans up files older than 1 hour. Manual cleanup: ```bash docs-scraper cleanup # Delete all PDFs docs-scraper cleanup --older-than 1h # Delete PDFs older than 1 hour ``` ## Job management ```bash docs-scraper jobs list # List blocked jobs awaiting auth ``` ## Supported sources - **Direct PDF links** - Downloads PDF directly - **Notion pages** - Exports Notion page to PDF - **DocSend documents** - Handles DocSend viewer - **LLM fallback** - Uses Claude API for any other webpage --- ## Scraper Reference Each scraper accepts specific `-D` data fields. Use the appropriate fields based on the URL type. ### DirectPdfScraper **Handles:** URLs ending in `.pdf` **Data fields:** None (downloads directly) **Example:** ```bash docs-scraper scrape https://example.com/document.pdf ``` --- ### DocsendScraper **Handles:** `docsend.com/view/*`, `docsend.com/v/*`, and subdomains (e.g., `org-a.docsend.com`) **URL patterns:** - Documents: `https://docsend.com/view/{id}` or `https://docsend.com/v/{id}` - Folders: `https://docsend.com/view/s/{id}` - Subdomains: `https://{subdomain}.docsend.com/view/{id}` **Data fields:** | Field | Type | Description | |-------|------|-------------| | `email` | email | Email address for document access | | `password` | password | Passcode/password for protected documents | | `name` | text | Your name (required for NDA-gated documents) | **Examples:** ```bash # Pre-fill email for DocSend docs-scraper scrape https://docsend.com/view/abc123 -D email=user@example.com # With password protection docs-scraper scrape https://docsend.com/view/abc123 -D email=user@example.com -D password=secret123 # With NDA name requirement docs-scraper scrape https://docsend.com/view/abc123 -D email=user@example.com -D name="John Doe" # Retry blocked job docs-scraper update abc123 -D email=user@example.com -D password=secret123 ``` **Notes:** - DocSend may require any combination of email, password, and name - Folders are scraped as a table of contents PDF with document links - The scraper auto-checks NDA checkboxes when name is provided --- ### NotionScraper **Handles:** `notion.so/*`, `*.notion.site/*` **Data fields:** | Field | Type | Description | |-------|------|-------------| | `email` | email | Notion account email | | `password` | password | Notion account password | **Examples:** ```bash # Public page (no auth needed) docs-scraper scrape https://notion.so/Public-Page-abc123 # Private page with login docs-scraper scrape https://notion.so/Private-Page-abc123 \ -D email=user@example.com -D password=mypassword # Custom domain docs-scraper scrape https://docs.company.notion.site/Page-abc123 ``` **Notes:** - Public Notion pages don't require authentication - Toggle blocks are automatically expanded before PDF generation - Uses session profiles to persist login across scrapes --- ### LlmFallbackScraper **Handles:** Any URL not matched by other scrapers (automatic fallback) **Data fields:** Dynamic - determined by Claude analyzing the page The LLM scraper uses Claude to analyze the page HTML and detect: - Login forms (extracts field names dynamically) - Cookie banners (auto-dismisses) - Expandable content (auto-expands) - CAPTCHAs (reports as blocked) - Paywalls (reports as blocked) **Common dynamic fields:** | Field | Type | Description | |-------|------|-------------| | `email` | email | Login email (if detected) | | `password` | password | Login password (if detected) | | `username` | text | Username (if login uses username) | **Examples:** ```bash # Generic webpage (no auth) docs-scraper scrape https://example.com/article # Webpage requiring login docs-scraper scrape https://members.example.com/article \ -D email=user@example.com -D password=secret # When blocked, check the job for required fields docs-scraper jobs list # Then retry with the fields the scraper detected docs-scraper update abc123 -D username=myuser -D password=secret ``` **Notes:** - Requires `ANTHROPIC_API_KEY` environment variable - Field names are extracted from the page's actual form fields - Limited to 2 login attempts before failing - CAPTCHAs require manual intervention --- ## Data field summary | Scraper | email | password | name | Other | |---------|-------|----------|------|-------| | DirectPdf | - | - | - | - | | DocSend | ✓ | ✓ | ✓ | - | | Notion | ✓ | ✓ | - | - | | LLM Fallback | ✓* | ✓* | - | Dynamic* | *Fields detected dynamically from page analysis ## Environment setup (optional) Only needed for LLM fallback scraper: ```bash export ANTHROPIC_API_KEY=your_key ``` Optional browser settings: ```bash export BROWSER_HEADLESS=true # Set false for debugging ``` ## Common patterns **Archive a Notion page:** ```bash docs-scraper scrape https://notion.so/My-Page-abc123 ``` **Download protected DocSend:** ```bash docs-scraper scrape https://docsend.com/view/xxx # If blocked: docs-scraper update <job-id> -D email=user@example.com -D password=1234 ``` **Batch scraping with profiles:** ```bash docs-scraper scrape https://site.com/doc1 -p mysite docs-scraper scrape https://site.com/doc2 -p mysite ``` ## Output **Success**: Local file path (e.g., `~/.docs-scraper/output/1706123456-abc123.pdf`) **Blocked**: Job ID + required credential types ## Troubleshooting - **Timeout**: `docs-scraper daemon stop && docs-scraper daemon start` - **Auth fails**: `docs-scraper jobs list` to check pending jobs - **Disk full**: `docs-scraper cleanup` to remove old PDFs

标签

skill ai

通过对话安装

该技能支持在以下平台通过对话安装:

OpenClaw WorkBuddy QClaw Kimi Claude

方式一:安装 SkillHub 和技能

帮我安装 SkillHub 和 links-to-pdfs-1776371772 技能

方式二:设置 SkillHub 为优先技能安装源

设置 SkillHub 为我的优先技能安装源,然后帮我安装 links-to-pdfs-1776371772 技能

通过命令行安装

skillhub install links-to-pdfs-1776371772

下载 Zip 包

⬇ 下载 scraper v0.0.1

文件大小: 3.2 KB | 发布时间: 2026-4-17 14:09

v0.0.1 最新 2026-4-17 14:09
Initial public release of the links-to-pdfs skill.

- Scrapes documents from Notion, DocSend, direct PDFs, and other web sources into local PDF files.
- Supports authentication workflows and session persistence via profiles for protected documents.
- Includes a command-line interface with profile and job management, daemon for faster scrapes, and automatic cleanup.
- Provides fallback to LLM-based scraping for unsupported or dynamic websites.
- Returns local file paths to downloaded PDFs.

Archiver·手机版·闲社网·闲社论坛·羊毛社区· 多链控股集团有限公司 · 苏ICP备2025199260号-1

Powered by Discuz! X5.0   © 2024-2025 闲社网·线报更新论坛·羊毛分享社区·http://xianshe.com

p2p_official_large
返回顶部