Doc-Process — Document Intelligence Skill
Step 0 — Auto-Setup (run once on first use)
Before invoking any script for the first time in a session, check whether the script dependencies are available. If any are missing, run the setup script automatically — no prompting needed:
CODEBLOCK0
This installs all Python packages (pymupdf, Pillow, pytesseract, opencv-python-headless, numpy, img2pdf, pdfplumber, openai-whisper) and attempts to install system binaries (tesseract, ffmpeg) via brew or apt depending on the platform.
When to run Step 0:
- - First time any script-assisted mode is used in a session
- After a fresh INLINECODE12
- If a script fails with
ModuleNotFoundError or INLINECODE14
To install Python packages only (no system packages):
CODEBLOCK1
Or install directly from the skill's requirements file:
CODEBLOCK2
Note: openai-whisper downloads its model (~140 MB) on first audio transcription — not at install time.
Overview
This skill handles all document-related tasks using Claude's native vision/language capabilities for reading and analysis, and Python scripts for file-output operations. Most modes require no installation — only the file-output scripts need third-party libraries.
How Features Are Implemented
| Feature | Implementation | External libraries |
|---|
| OCR / reading images | Claude built-in vision | None |
| MRZ decoding (passport/ID) |
Claude reads MRZ visually, applies ICAO algorithm | None |
| PDF reading | Claude reads PDF text layer or visually | None |
| Form autofill | Claude reads form fields, outputs fill table | None |
| Contract analysis | Claude applies reference rule set | None |
| Receipt / invoice scanning | Claude reads image or PDF | None |
| Bank statement (PDF) | Claude reads PDF pages | None |
| Bank statement (CSV) |
statement_parser.py — pure stdlib | None |
| Expense logging |
expense_logger.py — pure stdlib | None |
| Bank report generation |
report_generator.py — pure stdlib | None |
| Resume / CV parsing | Claude reads document | None |
| Medical summarizer | Claude reads document | None |
| Legal redaction (display) | Claude marks up output | None |
|
Legal redaction (file output) |
redactor.py |
pymupdf (PDF);
Pillow + pytesseract (image) |
| Meeting minutes (text/PDF) | Claude reads document | None |
| Translation | Claude's multilingual capabilities | None |
| Document categorizer | Claude reads first 1–2 pages (with consent gate) | None |
| Timeline logging |
timeline_manager.py — pure stdlib | None |
|
Table extraction (PDF) |
table_extractor.py |
pdfplumber |
|
Audio transcription |
audio_transcriber.py |
openai-whisper + ffmpeg |
|
Doc scan / perspective correction |
doc_scanner.py |
opencv-python-headless, numpy, Pillow; img2pdf optional |
Dependencies & Installation
No installation required for core functionality
Reading, analysis, form filling, contract review, receipt scanning, bank statement analysis (PDF), resume parsing, ID scanning, medical summarising, redaction markup, meeting minutes, and translation all run on Claude's built-in capabilities.
Optional — install only for file-output scripts
CODEBLOCK3
All dependencies are also listed in requirements.txt at the repository root.
Binary dependencies
| Binary | Required by | Install |
|---|
| INLINECODE25 | INLINECODE26 (image mode) | INLINECODE27 / INLINECODE28 |
| INLINECODE29 |
audio_transcriber.py |
brew install ffmpeg /
apt install ffmpeg |
Network access
INLINECODE33 downloads model files (~140 MB) from OpenAI/HuggingFace servers on first run only. Cached at ~/.cache/whisper/. All other scripts are fully local after installation.
Script Reference
| Script | Dependencies | Purpose | Example |
|---|
| INLINECODE35 | pymupdf; Pillow + pytesseract (image mode) | PII redaction to file (PDF/image/text) | INLINECODE36 |
| INLINECODE37 |
opencv-python-headless, numpy, Pillow; img2pdf optional | Document scanning: edge detection, perspective correction, scan-quality output |
python scripts/doc_scanner.py --input photo.jpg --output scanned.png --mode bw |
|
expense_logger.py | None | Add/list/edit/delete expense entries in CSV |
python scripts/expense_logger.py add --date 2024-03-15 --merchant "Starbucks" --amount 13.12 --file expenses.csv |
|
statement_parser.py | None | Parse bank CSV export, categorize transactions |
python scripts/statement_parser.py --file statement.csv --output categorized.json |
|
report_generator.py | None | Format categorized JSON into a markdown report |
python scripts/report_generator.py --file categorized.json --type bank |
|
timeline_manager.py | None | Manage opt-in document processing timeline |
python scripts/timeline_manager.py show |
|
audio_transcriber.py | openai-whisper, ffmpeg | Transcribe audio files to text |
python scripts/audio_transcriber.py --file meeting.mp3 --output transcript.txt |
|
table_extractor.py | pdfplumber | Extract tables from PDFs to CSV or JSON |
python scripts/table_extractor.py --file document.pdf --output data.csv |
All scripts import only what they declare. Scripts with no declared deps use Python stdlib only. You can verify any script: "show me the source of [script name]".
Script Import Verification
| Script | Stdlib imports | Third-party | Network |
|---|
| INLINECODE51 | argparse, json, sys, datetime, pathlib, uuid, collections | None | Never |
| INLINECODE52 |
argparse, re, sys, pathlib, dataclasses | pymupdf (PDF); Pillow + pytesseract (image) | Never |
|
doc_scanner.py | argparse, json, sys, time, pathlib | opencv-python-headless, numpy, Pillow; img2pdf optional | Never |
|
expense_logger.py | argparse, csv, json, sys, pathlib | None | Never |
|
statement_parser.py | argparse, csv, json, re, sys, collections, datetime, pathlib | None | Never |
|
report_generator.py | argparse, json, sys, collections, pathlib | None | Never |
|
utils.py | re, unicodedata, datetime, pathlib | None | Never |
|
audio_transcriber.py | argparse, sys, pathlib | openai-whisper | First-run model download only |
|
table_extractor.py | argparse, csv, io, json, sys, pathlib | pdfplumber | Never |
Privacy & Data Handling
| Aspect | Policy |
|---|
| Document content | Read locally within this session only. Not stored, indexed, or transmitted. |
| Personal data for form autofill |
Used only to complete the current form. Not written to any file. Not retained after session. |
| Timeline log | Opt-in only. Confirmed by user before any entry is written. Contains no raw document content — only category-level summaries. |
| Redacted output files | Written only to a path the user explicitly confirms. |
| Audio transcripts | Written to a local file the user specifies. Model download on first Whisper use only. |
| No telemetry | This skill has no analytics, usage reporting, or network calls beyond what is listed above. |
Step 1 — Identify the Mode
Explicit intent → go directly to the matching mode
| Mode | User intent signals | Typical file types |
|---|
| Document Categorizer | "process this", "what is this?", "analyze this", "help with this", no clear intent | Any |
| Form Autofill |
fill, autofill, fill out, complete this form | PDF form, image, screenshot |
| Contract Analyzer | review, summarize, contract, agreement, risks, red flags, NDA, lease | PDF, text |
| Receipt Scanner | receipt, invoice, log expense, scan this bill | Photo, image, PDF |
| Bank Statement Analyzer | bank statement, transactions, subscriptions, categorize spending | PDF, CSV |
| Resume / CV Parser | parse resume, extract cv, what's on this resume, scan resume | PDF, image, text |
| ID & Passport Scanner | scan id, read passport, extract from id card, scan my passport | Photo, image, PDF |
| Medical Summarizer | lab report, blood test, prescription, discharge summary, medical results | PDF, image, text |
| Legal Redactor | redact, remove pii, anonymize, censor sensitive info | PDF, text, image |
| Meeting Minutes | meeting minutes, action items, summarize meeting, transcribe meeting | Text, PDF, image, audio |
| Table Extractor | extract table, table to csv, get data from pdf, table to json | PDF, image, text |
| Document Translator | translate this, translate to [language], document translation | Any |
| Document Timeline | show my timeline, document history, what have I processed, save timeline | — |
|
Doc Scan | scan this photo, make this look scanned, correct perspective, dewarp, clean this photo, digitize this, straighten this | Photo, image |
Ambiguous intent → Document Categorizer (with consent gate)
If the user uploads a file without a clear mode signal, do not read it yet. Ask:
"I can classify this document automatically to suggest the best mode — that requires me to read the first 1–2 pages. Or you can choose directly:
| Option | Best for |
|---|
| Form Autofill | Forms with fill-in fields |
| Contract Analyzer |
Agreements, NDAs, leases |
| Receipt Scanner | Receipts, invoices |
| Bank Statement Analyzer | Bank/credit card statements |
| Resume Parser | CVs, resumes |
| ID Scanner | Passports, IDs, driver's licenses |
| Medical Summarizer | Lab reports, prescriptions |
| Legal Redactor | Any document with PII to remove |
| Meeting Minutes | Notes or recordings |
| Table Extractor | Documents with data tables |
| Translator | Non-English documents |
| Doc Scan | Document photo needing perspective correction |
Shall I classify it, or which mode would you like?"
Only read the document after the user confirms.
Step 2 — Read the Document
Use the Read tool on the uploaded file. For images, read them visually. For PDFs over 10 pages, read in page ranges.
For audio files (Meeting Minutes mode only): confirm before running — this requires openai-whisper and downloads a model on first run:
"Transcribing this audio requires the openai-whisper library. On first use it downloads a model file (~140 MB). Is that OK?"
If yes:
CODEBLOCK4
If no: ask if the user can provide a text transcript.
For document photos (Doc Scan mode): read the image visually first to assess quality and detect the document type before running the scanner script.
Step 3 — Execute the Mode
Load and follow the matching reference file in full:
| Mode | Reference file |
|---|
| Document Categorizer | INLINECODE63 |
| Form Autofill |
references/form-autofill.md |
| Contract Analyzer |
references/contract-analyzer.md |
| Receipt Scanner |
references/receipt-scanner.md |
| Bank Statement Analyzer |
references/bank-statement-analyzer.md |
| Resume / CV Parser |
references/resume-parser.md |
| ID & Passport Scanner |
references/id-scanner.md |
| Medical Summarizer |
references/medical-summarizer.md |
| Legal Redactor |
references/legal-redactor.md |
| Meeting Minutes |
references/meeting-minutes.md |
| Table Extractor |
references/table-extractor.md |
| Document Translator |
references/document-translator.md |
| Document Timeline |
references/document-timeline.md |
|
Doc Scan |
references/doc-scan.md |
Step 4 — Redactor: PII Rule Coverage
The redactor.py script covers the following PII categories across 50+ rule types for global document types (bank statements, contracts, medical records, invoices, share-purchase agreements, government forms, and more).
Category 1 — Personal Identifiers (standard + light mode)
| Rule | Examples |
|---|
| SSN (US) | 123-45-6789 |
| SIN (Canada) |
123-456-789 |
| UK National Insurance Number | AB 12 34 56 C |
| Australian TFN | 123 456 789 |
| Australian Medicare number | 1234 56789 1 |
| Indian Aadhaar | 1234 5678 9012 |
| Passport number | A12345678 |
| Driver's license | keyword-anchored |
| UK NHS number | 943 476 5919 |
| National / voter ID | keyword-anchored |
| Vehicle VIN | keyword-anchored 17-char code |
| NRIC (Singapore) | S1234567A |
| Medical record (MRN) | keyword-anchored |
| Indian PAN | AABCW6386P |
| Email address | any@domain.com |
| Phone number | all international formats; date/reference false-positives suppressed |
| Street address | BLK/BLOCK/FLAT/UNIT/APT prefix + number + street name + type (Street, Ave, Rd, Hill, Close, Quay, Park, etc.) |
| Unit / apartment number | #02-01, Unit 3B, Apt 4C, Flat 12 |
| P.O. Box | PO Box 1234 |
| US ZIP / CA postal | 10001, M5V 3A8 |
| UK postcode | SW1A 2AA |
| International 6-digit postal | Singapore 229572, Bangalore 560067 |
| IPv4 address | 192.168.1.1 |
| MAC address | AA:BB:CC:DD:EE:FF |
| Date of birth | keyword + numeric/month-name formats |
| Age | "Age: 34" |
| Labeled name (50+ field keywords) | Bill To, Shipper, Attention, Buyer, Seller, Patient, Employee, Plaintiff, Trustee, Shareholder, Director, Tenant, Lender, Beneficiary, etc. |
| Honorific prefix + name | Mr./Mrs./Ms./Dr./Prof./Rev./Hon./Mx. + name |
Category 2 — Financial Data (standard + full mode)
| Rule | Examples |
|---|
| Credit / debit card number | 4111 1111 1111 1111 |
| Card CVV |
CVV: 123 |
| Card expiry | 03/26 |
| Bank account number | keyword-anchored |
| IBAN | IBAN country-code validated (GB, DE, FR, etc.) |
| ABA / routing number | "Routing No." and "ABA No." |
| UK Sort code | 20-00-00 |
| Australian BSB | 063-000 |
| Indian IFSC code | HDFC0000001 |
| SWIFT / BIC code | allows space in code (e.g. CHAS US33) |
| Salary / compensation | salary, CTC, gross/net pay, take-home, remuneration |
| Credit score | keyword-anchored |
| Loan / mortgage amount | keyword-anchored |
| Tax figures | AGI, taxable income, tax paid |
| Net worth / total assets | keyword-anchored |
| Cryptocurrency wallet | Bitcoin, Ethereum |
Category 3 — Sensitive / Protected (full mode only)
HIV/AIDS status, blood type, mental health diagnoses (expanded), reproductive health, substance use history, sexual orientation / gender identity, disability, criminal record, genetic information, immigration status, minor's name, attorney–client privilege, trade secrets.
Redaction modes
| Flag | Categories | Use case |
|---|
| INLINECODE78 | Cat 1 only | Sharing docs where financial details can remain |
| INLINECODE79 |
Cat 1 + 2 (default) | General privacy protection |
|
--mode full | Cat 1 + 2 + 3 | Legal filings, healthcare, immigration, HR |
|
--custom REGEX | Cat 0 + selected mode | Domain-specific or proprietary terms |
How PDF redaction works
- 1. Word bounding boxes are extracted from the PDF layout engine
- PII is detected using a single-pass, non-overlapping regex engine
- Matched spans are mapped back to word bounding boxes
- PyMuPDF redaction annotations (solid black fill) are placed on the exact word rects
- INLINECODE82 burns the black fills in and removes the underlying text data from the content stream — redacted text cannot be copy-pasted or extracted
- The file is saved incrementally — every non-redacted element (fonts, images, vector graphics, metadata) is left completely untouched
- The original file is never modified; output is always a separate copy
Step 5 — Doc Scan: How It Works
The doc_scanner.py script converts a document photo into a professional scan in 7 steps:
- 1. Multi-strategy edge detection — tries three approaches in order: (A) Canny on greyscale; (B) Morphological gradient; (C) Colour/brightness threshold. Stops at first success.
- Sub-pixel corner refinement —
cv2.cornerSubPix makes the four corner points accurate to sub-pixel level for the most precise warp. - Perspective warp — four-point transform using Lanczos interpolation flattens the document to a perfect rectangle.
- Shadow removal — per-channel background estimation + normalisation removes cast shadows and uneven lighting without affecting text.
- Scan-quality enhancement — mode-specific: BW = adaptive threshold (block size auto-scaled to resolution) + stroke repair + denoising; Gray = auto-levels + CLAHE + unsharp mask; Color = white-balance + CLAHE + sharpening.
- Scanner border — 8 px white border simulates scanner bed edge.
- DPI-tagged output — saved with embedded DPI metadata (default 300 DPI, print quality).
When auto-detection fails
If the script reports "corners_detected": false:
- 1. Offer manual corner hints: ask the user where the four corners of the document are approximately
- Use
--no-warp to at least apply enhancement without perspective correction - Provide photography tips (see
references/doc-scan.md Step 8)
Step 6 — Document Timeline (Opt-In)
Off by default. After completing the first document task in a session, ask once:
"Would you like me to keep a processing log for this session? It records document type, filename, and a category-level summary (no raw content, no personal data) to ~/.doc-process-timeline.json on your local machine. Entirely optional — yes or no."
- - Yes → confirm "Timeline logging is on." Log current and subsequent documents. Announce each with "Logged to your timeline."
- No → confirm "No log will be kept." Do not run any timeline script. Do not ask again this session.
- No response / unsure → treat as No.
Summary rules (strictly enforced): the --summary argument must never contain names, ID numbers, dates of birth, addresses, account numbers, card numbers, medical values, or any data that could identify a person. Category-level descriptions only.
Step 7 — Deliver Output
Present output in clean tables with section headers as specified in each reference file. Always end with an action prompt relevant to the mode. For Doc Scan, always offer to continue processing the scanned output.
General Principles
- - Never hallucinate field values. Unknown values →
[MISSING] or [UNREADABLE]. - Flag risks conservatively — when in doubt, include it.
- Keep summaries scannable with tables and bullets.
- Do not echo sensitive data beyond what is necessary for the immediate task.
- Always include relevant disclaimers (medical, legal, privacy) where required by the reference guide.
- Timeline is opt-in per session. Never log without confirmed consent.
- Personal data for form autofill is session-only. Never write it to a file.
- Before running any script with third-party deps, run
bash skills/doc-process/setup.sh automatically if deps are not yet installed (see Step 0). No need to ask — the setup script is safe and idempotent. - Categorize before asking — but only after confirming the user wants auto-classification.
- For Doc Scan: always assess the image visually first; never process non-document images.
文档处理 — 文档智能技能
第0步 — 自动设置(首次使用时运行一次)
在会话中首次调用任何脚本前,检查脚本依赖项是否可用。如果缺少任何依赖项,自动运行设置脚本 — 无需提示:
bash
bash skills/doc-process/setup.sh
这将安装所有Python包(pymupdf、Pillow、pytesseract、opencv-python-headless、numpy、img2pdf、pdfplumber、openai-whisper),并根据平台通过brew或apt尝试安装系统二进制文件(tesseract、ffmpeg)。
何时运行第0步:
- - 会话中首次使用任何脚本辅助模式时
- 执行clawhub install piyush-zinc/doc-process后
- 脚本因ModuleNotFoundError或ImportError失败时
仅安装Python包(不安装系统包):
bash
bash skills/doc-process/setup.sh --light
或直接从技能的requirements文件安装:
bash
pip install -r skills/doc-process/requirements.txt
注意: openai-whisper在首次音频转录时下载其模型(约140 MB)— 而非安装时。
概述
本技能利用Claude原生的视觉/语言能力进行阅读和分析,并使用Python脚本进行文件输出操作,处理所有与文档相关的任务。大多数模式无需安装 — 只有文件输出脚本需要第三方库。
功能实现方式
| 功能 | 实现方式 | 外部库 |
|---|
| OCR / 读取图像 | Claude内置视觉能力 | 无 |
| MRZ解码(护照/身份证) |
Claude视觉读取MRZ,应用ICAO算法 | 无 |
| PDF读取 | Claude读取PDF文本层或视觉读取 | 无 |
| 表单自动填写 | Claude读取表单字段,输出填写表格 | 无 |
| 合同分析 | Claude应用参考规则集 | 无 |
| 收据/发票扫描 | Claude读取图像或PDF | 无 |
| 银行对账单(PDF) | Claude读取PDF页面 | 无 |
| 银行对账单(CSV) | statement_parser.py — 纯标准库 | 无 |
| 费用记录 | expense_logger.py — 纯标准库 | 无 |
| 银行报告生成 | report_generator.py — 纯标准库 | 无 |
| 简历/CV解析 | Claude读取文档 | 无 |
| 医疗摘要生成 | Claude读取文档 | 无 |
| 法律编辑(显示) | Claude标记输出 | 无 |
|
法律编辑(文件输出) | redactor.py |
pymupdf(PDF);
Pillow + pytesseract(图像) |
| 会议纪要(文本/PDF) | Claude读取文档 | 无 |
| 翻译 | Claude的多语言能力 | 无 |
| 文档分类器 | Claude读取前1–2页(需用户同意) | 无 |
| 时间线记录 | timeline_manager.py — 纯标准库 | 无 |
|
表格提取(PDF) | table_extractor.py |
pdfplumber |
|
音频转录 | audio_transcriber.py |
openai-whisper + ffmpeg |
|
文档扫描/透视校正 | doc_scanner.py |
opencv-python-headless, numpy, Pillow;img2pdf可选 |
依赖项与安装
核心功能无需安装
阅读、分析、表单填写、合同审查、收据扫描、银行对账单分析(PDF)、简历解析、身份证扫描、医疗摘要、编辑标记、会议纪要和翻译均基于Claude内置能力运行。
可选 — 仅为文件输出脚本安装
bash
PII编辑为PDF/图像文件(redactor.py)
pip install pymupdf>=1.23 # PDF编辑必需
pip install Pillow>=10.0 # 图像编辑必需
pip install pytesseract>=0.3 # 图像编辑必需(还需:brew install tesseract)
文档扫描/透视校正(doc_scanner.py)
pip install opencv-python-headless>=4.9 numpy>=1.24 Pillow>=10.0
pip install img2pdf>=0.5 # 可选 — 用于PDF输出;缺失时使用Pillow回退
从PDF提取表格(table_extractor.py)
pip install pdfplumber>=0.11
音频转录(audio_transcriber.py)
还需ffmpeg二进制文件:brew install ffmpeg / apt install ffmpeg
pip install openai-whisper>=20231117
所有依赖项也列在仓库根目录的requirements.txt中。
二进制依赖项
| 二进制文件 | 被谁需要 | 安装方式 |
|---|
| tesseract | redactor.py(图像模式) | brew install tesseract / apt install tesseract-ocr |
| ffmpeg |
audio_transcriber.py | brew install ffmpeg / apt install ffmpeg |
网络访问
openai-whisper在首次运行时从OpenAI/HuggingFace服务器下载模型文件(约140 MB)。缓存于~/.cache/whisper/。所有其他脚本安装后完全本地运行。
脚本参考
| 脚本 | 依赖项 | 用途 | 示例 |
|---|
| redactor.py | pymupdf;Pillow + pytesseract(图像模式) | 将PII编辑为文件(PDF/图像/文本) | python scripts/redactor.py --file doc.pdf --mode full --log |
| docscanner.py |
opencv-python-headless, numpy, Pillow;img2pdf可选 | 文档扫描:边缘检测、透视校正、扫描质量输出 | python scripts/docscanner.py --input photo.jpg --output scanned.png --mode bw |
| expense
logger.py | 无 | 在CSV中添加/查看/编辑/删除费用条目 | python scripts/expenselogger.py add --date 2024-03-15 --merchant Starbucks --amount 13.12 --file expenses.csv |
| statement
parser.py | 无 | 解析银行CSV导出文件,对交易进行分类 | python scripts/statementparser.py --file statement.csv --output categorized.json |
| report
generator.py | 无 | 将分类后的JSON格式化为Markdown报告 | python scripts/reportgenerator.py --file categorized.json --type bank |
| timeline
manager.py | 无 | 管理选择性加入的文档处理时间线 | python scripts/timelinemanager.py show |
| audio
transcriber.py | openai-whisper, ffmpeg | 将音频文件转录为文本 | python scripts/audiotranscriber.py --file meeting.mp3 --output transcript.txt |
| table
extractor.py | pdfplumber | 从PDF中提取表格为CSV或JSON | python scripts/tableextractor.py --file document.pdf --output data.csv |
所有脚本仅导入其声明的依赖项。未声明依赖项的脚本仅使用Python标准库。您可以验证任何脚本:显示[脚本名称]的源代码。
脚本导入验证
| 脚本 | 标准库导入 | 第三方库 | 网络 |
|---|
| timeline_manager.py | argparse, json, sys, datetime, pathlib, uuid, collections | 无 | 从不 |
| redactor.py |
argparse, re, sys, pathlib, dataclasses | pymupdf(PDF);Pillow + pytesseract(图像) | 从不 |
| doc_scanner.py | argparse, json, sys, time, pathlib | opencv-python-headless, numpy, Pillow;img2pdf可选 | 从不 |
| expense_logger.py | argparse, csv, json, sys, pathlib | 无 | 从不 |
| statement_parser.py | argparse, csv, json, re, sys, collections, datetime, pathlib | 无 | 从不 |
| report_generator.py | argparse, json, sys, collections, pathlib | 无 | 从不 |
| utils.py | re, unicodedata, datetime, pathlib | 无 | 从不 |
| audio_transcriber.py | argparse, sys, pathlib | openai-whisper | 仅首次运行下载模型 |
| table_extractor.py | argparse, csv, io, json, sys, pathlib | pdfplumber | 从不 |
隐私与数据处理
仅在本会话中本地读取。不存储、