scraper

# docs-scraper CLI tool that scrapes documents from various sources into local PDF files using browser automation. ## Installation ```bash npm install -g docs-scraper ``` ## Quick start Scrape any document URL to PDF: ```bash docs-scraper scrape https://example.com/document ``` Returns local path: `~/.docs-scraper/output/1706123456-abc123.pdf` ## Basic scraping **Scrape with daemon** (recommended, keeps browser warm): ```bash docs-scraper scrape <url> ``` **Scrape with named profile** (for authenticated sites): ```bash docs-scraper scrape <url> -p <profile-name> ``` **Scrape with pre-filled data** (e.g., email for DocSend): ```bash docs-scraper scrape <url> -D email=user@example.com ``` **Direct mode** (single-shot, no daemon): ```bash docs-scraper scrape <url> --no-daemon ``` ## Authentication workflow When a document requires authentication (login, email verification, passcode): 1. Initial scrape returns a job ID: ```bash docs-scraper scrape https://docsend.com/view/xxx # Output: Scrape blocked # Job ID: abc123 ``` 2. Retry with data: ```bash docs-scraper update abc123 -D email=user@example.com # or with password docs-scraper update abc123 -D email=user@example.com -D password=1234 ``` ## Profile management Profiles store session cookies for authenticated sites. ```bash docs-scraper profiles list # List saved profiles docs-scraper profiles clear # Clear all profiles docs-scraper scrape <url> -p myprofile # Use a profile ``` ## Daemon management The daemon keeps browser instances warm for faster scraping. ```bash docs-scraper daemon status # Check status docs-scraper daemon start # Start manually docs-scraper daemon stop # Stop daemon ``` Note: Daemon auto-starts when running scrape commands. ## Cleanup PDFs are stored in `~/.docs-scraper/output/`. The daemon automatically cleans up files older than 1 hour. Manual cleanup: ```bash docs-scraper cleanup # Delete all PDFs docs-scraper cleanup --older-than 1h # Delete PDFs older than 1 hour ``` ## Job management ```bash docs-scraper jobs list # List blocked jobs awaiting auth ``` ## Supported sources - **Direct PDF links** - Downloads PDF directly - **Notion pages** - Exports Notion page to PDF - **DocSend documents** - Handles DocSend viewer - **LLM fallback** - Uses Claude API for any other webpage --- ## Scraper Reference Each scraper accepts specific `-D` data fields. Use the appropriate fields based on the URL type. ### DirectPdfScraper **Handles:** URLs ending in `.pdf` **Data fields:** None (downloads directly) **Example:** ```bash docs-scraper scrape https://example.com/document.pdf ``` --- ### DocsendScraper **Handles:** `docsend.com/view/*`, `docsend.com/v/*`, and subdomains (e.g., `org-a.docsend.com`) **URL patterns:** - Documents: `https://docsend.com/view/{id}` or `https://docsend.com/v/{id}` - Folders: `https://docsend.com/view/s/{id}` - Subdomains: `https://{subdomain}.docsend.com/view/{id}` **Data fields:** | Field | Type | Description | |-------|------|-------------| | `email` | email | Email address for document access | | `password` | password | Passcode/password for protected documents | | `name` | text | Your name (required for NDA-gated documents) | **Examples:** ```bash # Pre-fill email for DocSend docs-scraper scrape https://docsend.com/view/abc123 -D email=user@example.com # With password protection docs-scraper scrape https://docsend.com/view/abc123 -D email=user@example.com -D password=secret123 # With NDA name requirement docs-scraper scrape https://docsend.com/view/abc123 -D email=user@example.com -D name="John Doe" # Retry blocked job docs-scraper update abc123 -D email=user@example.com -D password=secret123 ``` **Notes:** - DocSend may require any combination of email, password, and name - Folders are scraped as a table of contents PDF with document links - The scraper auto-checks NDA checkboxes when name is provided --- ### NotionScraper **Handles:** `notion.so/*`, `*.notion.site/*` **Data fields:** | Field | Type | Description | |-------|------|-------------| | `email` | email | Notion account email | | `password` | password | Notion account password | **Examples:** ```bash # Public page (no auth needed) docs-scraper scrape https://notion.so/Public-Page-abc123 # Private page with login docs-scraper scrape https://notion.so/Private-Page-abc123 \ -D email=user@example.com -D password=mypassword # Custom domain docs-scraper scrape https://docs.company.notion.site/Page-abc123 ``` **Notes:** - Public Notion pages don't require authentication - Toggle blocks are automatically expanded before PDF generation - Uses session profiles to persist login across scrapes --- ### LlmFallbackScraper **Handles:** Any URL not matched by other scrapers (automatic fallback) **Data fields:** Dynamic - determined by Claude analyzing the page The LLM scraper uses Claude to analyze the page HTML and detect: - Login forms (extracts field names dynamically) - Cookie banners (auto-dismisses) - Expandable content (auto-expands) - CAPTCHAs (reports as blocked) - Paywalls (reports as blocked) **Common dynamic fields:** | Field | Type | Description | |-------|------|-------------| | `email` | email | Login email (if detected) | | `password` | password | Login password (if detected) | | `username` | text | Username (if login uses username) | **Examples:** ```bash # Generic webpage (no auth) docs-scraper scrape https://example.com/article # Webpage requiring login docs-scraper scrape https://members.example.com/article \ -D email=user@example.com -D password=secret # When blocked, check the job for required fields docs-scraper jobs list # Then retry with the fields the scraper detected docs-scraper update abc123 -D username=myuser -D password=secret ``` **Notes:** - Requires `ANTHROPIC_API_KEY` environment variable - Field names are extracted from the page's actual form fields - Limited to 2 login attempts before failing - CAPTCHAs require manual intervention --- ## Data field summary | Scraper | email | password | name | Other | |---------|-------|----------|------|-------| | DirectPdf | - | - | - | - | | DocSend | ✓ | ✓ | ✓ | - | | Notion | ✓ | ✓ | - | - | | LLM Fallback | ✓* | ✓* | - | Dynamic* | *Fields detected dynamically from page analysis ## Environment setup (optional) Only needed for LLM fallback scraper: ```bash export ANTHROPIC_API_KEY=your_key ``` Optional browser settings: ```bash export BROWSER_HEADLESS=true # Set false for debugging ``` ## Common patterns **Archive a Notion page:** ```bash docs-scraper scrape https://notion.so/My-Page-abc123 ``` **Download protected DocSend:** ```bash docs-scraper scrape https://docsend.com/view/xxx # If blocked: docs-scraper update <job-id> -D email=user@example.com -D password=1234 ``` **Batch scraping with profiles:** ```bash docs-scraper scrape https://site.com/doc1 -p mysite docs-scraper scrape https://site.com/doc2 -p mysite ``` ## Output **Success**: Local file path (e.g., `~/.docs-scraper/output/1706123456-abc123.pdf`) **Blocked**: Job ID + required credential types ## Troubleshooting - **Timeout**: `docs-scraper daemon stop && docs-scraper daemon start` - **Auth fails**: `docs-scraper jobs list` to check pending jobs - **Disk full**: `docs-scraper cleanup` to remove old PDFs

scraper

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

scraper

scraper

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement