Doc-Process — Document Intelligence Skill

Step 0 — Auto-Setup (run once on first use)

Before invoking any script for the first time in a session, check whether the script dependencies are available. If any are missing, run the setup script automatically — no prompting needed:

CODEBLOCK0

This installs all Python packages (pymupdf, Pillow, pytesseract, opencv-python-headless, numpy, img2pdf, pdfplumber, openai-whisper) and attempts to install system binaries (tesseract, ffmpeg) via brew or apt depending on the platform.

When to run Step 0:

- First time any script-assisted mode is used in a session
After a fresh INLINECODE12
If a script fails with ModuleNotFoundError or INLINECODE14

To install Python packages only (no system packages):
CODEBLOCK1

Or install directly from the skill's requirements file:
CODEBLOCK2

Note: openai-whisper downloads its model (~140 MB) on first audio transcription — not at install time.

Overview

This skill handles all document-related tasks using Claude's native vision/language capabilities for reading and analysis, and Python scripts for file-output operations. Most modes require no installation — only the file-output scripts need third-party libraries.

How Features Are Implemented

Feature	Implementation	External libraries
OCR / reading images	Claude built-in vision	None
MRZ decoding (passport/ID)

Dependencies & Installation

No installation required for core functionality

Reading, analysis, form filling, contract review, receipt scanning, bank statement analysis (PDF), resume parsing, ID scanning, medical summarising, redaction markup, meeting minutes, and translation all run on Claude's built-in capabilities.

Optional — install only for file-output scripts

CODEBLOCK3

All dependencies are also listed in requirements.txt at the repository root.

Binary dependencies

Binary	Required by	Install
INLINECODE25	INLINECODE26 (image mode)	INLINECODE27 / INLINECODE28
INLINECODE29

audio_transcriber.py | brew install ffmpeg / apt install ffmpeg |

Network access

INLINECODE33 downloads model files (~140 MB) from OpenAI/HuggingFace servers on first run only. Cached at ~/.cache/whisper/. All other scripts are fully local after installation.

Script Reference

Script	Dependencies	Purpose	Example
INLINECODE35	pymupdf; Pillow + pytesseract (image mode)	PII redaction to file (PDF/image/text)	INLINECODE36
INLINECODE37

All scripts import only what they declare. Scripts with no declared deps use Python stdlib only. You can verify any script: "show me the source of [script name]".

Script Import Verification

Script	Stdlib imports	Third-party	Network
INLINECODE51	argparse, json, sys, datetime, pathlib, uuid, collections	None	Never
INLINECODE52

Privacy & Data Handling

Aspect	Policy
Document content	Read locally within this session only. Not stored, indexed, or transmitted.
Personal data for form autofill

Used only to complete the current form. Not written to any file. Not retained after session. |
| Timeline log | Opt-in only. Confirmed by user before any entry is written. Contains no raw document content — only category-level summaries. |
| Redacted output files | Written only to a path the user explicitly confirms. |
| Audio transcripts | Written to a local file the user specifies. Model download on first Whisper use only. |
| No telemetry | This skill has no analytics, usage reporting, or network calls beyond what is listed above. |

Step 1 — Identify the Mode

Explicit intent → go directly to the matching mode

Mode	User intent signals	Typical file types
Document Categorizer	"process this", "what is this?", "analyze this", "help with this", no clear intent	Any
Form Autofill

Ambiguous intent → Document Categorizer (with consent gate)

If the user uploads a file without a clear mode signal, do not read it yet. Ask:

"I can classify this document automatically to suggest the best mode — that requires me to read the first 1–2 pages. Or you can choose directly:
Option Best for
Form Autofill Forms with fill-in fields
Contract Analyzer
Agreements, NDAs, leases |
| Receipt Scanner | Receipts, invoices |
| Bank Statement Analyzer | Bank/credit card statements |
| Resume Parser | CVs, resumes |
| ID Scanner | Passports, IDs, driver's licenses |
| Medical Summarizer | Lab reports, prescriptions |
| Legal Redactor | Any document with PII to remove |
| Meeting Minutes | Notes or recordings |
| Table Extractor | Documents with data tables |
| Translator | Non-English documents |
| Doc Scan | Document photo needing perspective correction |
Shall I classify it, or which mode would you like?"

Option	Best for
Form Autofill	Forms with fill-in fields
Contract Analyzer

Only read the document after the user confirms.

Step 2 — Read the Document

Use the Read tool on the uploaded file. For images, read them visually. For PDFs over 10 pages, read in page ranges.

For audio files (Meeting Minutes mode only): confirm before running — this requires openai-whisper and downloads a model on first run:

"Transcribing this audio requires the openai-whisper library. On first use it downloads a model file (~140 MB). Is that OK?"

If yes:
CODEBLOCK4

If no: ask if the user can provide a text transcript.

For document photos (Doc Scan mode): read the image visually first to assess quality and detect the document type before running the scanner script.

Step 3 — Execute the Mode

Load and follow the matching reference file in full:

Mode	Reference file
Document Categorizer	INLINECODE63
Form Autofill

Step 4 — Redactor: PII Rule Coverage

The redactor.py script covers the following PII categories across 50+ rule types for global document types (bank statements, contracts, medical records, invoices, share-purchase agreements, government forms, and more).

Category 1 — Personal Identifiers (standard + light mode)

Rule	Examples
SSN (US)	123-45-6789
SIN (Canada)

Category 2 — Financial Data (standard + full mode)

Rule	Examples
Credit / debit card number	4111 1111 1111 1111
Card CVV

Category 3 — Sensitive / Protected (full mode only)

HIV/AIDS status, blood type, mental health diagnoses (expanded), reproductive health, substance use history, sexual orientation / gender identity, disability, criminal record, genetic information, immigration status, minor's name, attorney–client privilege, trade secrets.

Redaction modes

Flag	Categories	Use case
INLINECODE78	Cat 1 only	Sharing docs where financial details can remain
INLINECODE79

How PDF redaction works

1. Word bounding boxes are extracted from the PDF layout engine
PII is detected using a single-pass, non-overlapping regex engine
Matched spans are mapped back to word bounding boxes
PyMuPDF redaction annotations (solid black fill) are placed on the exact word rects
INLINECODE82 burns the black fills in and removes the underlying text data from the content stream — redacted text cannot be copy-pasted or extracted
The file is saved incrementally — every non-redacted element (fonts, images, vector graphics, metadata) is left completely untouched
The original file is never modified; output is always a separate copy

Step 5 — Doc Scan: How It Works

The doc_scanner.py script converts a document photo into a professional scan in 7 steps:

1. Multi-strategy edge detection — tries three approaches in order: (A) Canny on greyscale; (B) Morphological gradient; (C) Colour/brightness threshold. Stops at first success.
Sub-pixel corner refinement — cv2.cornerSubPix makes the four corner points accurate to sub-pixel level for the most precise warp.
Perspective warp — four-point transform using Lanczos interpolation flattens the document to a perfect rectangle.
Shadow removal — per-channel background estimation + normalisation removes cast shadows and uneven lighting without affecting text.
Scan-quality enhancement — mode-specific: BW = adaptive threshold (block size auto-scaled to resolution) + stroke repair + denoising; Gray = auto-levels + CLAHE + unsharp mask; Color = white-balance + CLAHE + sharpening.
Scanner border — 8 px white border simulates scanner bed edge.
DPI-tagged output — saved with embedded DPI metadata (default 300 DPI, print quality).

When auto-detection fails

If the script reports "corners_detected": false:

1. Offer manual corner hints: ask the user where the four corners of the document are approximately
Use --no-warp to at least apply enhancement without perspective correction
Provide photography tips (see references/doc-scan.md Step 8)

Step 6 — Document Timeline (Opt-In)

Off by default. After completing the first document task in a session, ask once:

"Would you like me to keep a processing log for this session? It records document type, filename, and a category-level summary (no raw content, no personal data) to ~/.doc-process-timeline.json on your local machine. Entirely optional — yes or no."

- Yes → confirm "Timeline logging is on." Log current and subsequent documents. Announce each with "Logged to your timeline."
No → confirm "No log will be kept." Do not run any timeline script. Do not ask again this session.
No response / unsure → treat as No.

Summary rules (strictly enforced): the --summary argument must never contain names, ID numbers, dates of birth, addresses, account numbers, card numbers, medical values, or any data that could identify a person. Category-level descriptions only.

Step 7 — Deliver Output

Present output in clean tables with section headers as specified in each reference file. Always end with an action prompt relevant to the mode. For Doc Scan, always offer to continue processing the scanned output.

General Principles

- Never hallucinate field values. Unknown values → [MISSING] or [UNREADABLE].
Flag risks conservatively — when in doubt, include it.
Keep summaries scannable with tables and bullets.
Do not echo sensitive data beyond what is necessary for the immediate task.
Always include relevant disclaimers (medical, legal, privacy) where required by the reference guide.
Timeline is opt-in per session. Never log without confirmed consent.
Personal data for form autofill is session-only. Never write it to a file.
Before running any script with third-party deps, run bash skills/doc-process/setup.sh automatically if deps are not yet installed (see Step 0). No need to ask — the setup script is safe and idempotent.
Categorize before asking — but only after confirming the user wants auto-classification.
For Doc Scan: always assess the image visually first; never process non-document images.

文档处理 — 文档智能技能

第0步 — 自动设置（首次使用时运行一次）

在会话中首次调用任何脚本前，检查脚本依赖项是否可用。如果缺少任何依赖项，自动运行设置脚本 — 无需提示：

bash
bash skills/doc-process/setup.sh

这将安装所有Python包（pymupdf、Pillow、pytesseract、opencv-python-headless、numpy、img2pdf、pdfplumber、openai-whisper），并根据平台通过brew或apt尝试安装系统二进制文件（tesseract、ffmpeg）。

何时运行第0步：

- 会话中首次使用任何脚本辅助模式时
执行clawhub install piyush-zinc/doc-process后
脚本因ModuleNotFoundError或ImportError失败时

仅安装Python包（不安装系统包）：
bash
bash skills/doc-process/setup.sh --light

或直接从技能的requirements文件安装：
bash
pip install -r skills/doc-process/requirements.txt

注意： openai-whisper在首次音频转录时下载其模型（约140 MB）— 而非安装时。

概述

本技能利用Claude原生的视觉/语言能力进行阅读和分析，并使用Python脚本进行文件输出操作，处理所有与文档相关的任务。大多数模式无需安装 — 只有文件输出脚本需要第三方库。

功能实现方式

功能	实现方式	外部库
OCR / 读取图像	Claude内置视觉能力	无
MRZ解码（护照/身份证）

依赖项与安装

核心功能无需安装

阅读、分析、表单填写、合同审查、收据扫描、银行对账单分析（PDF）、简历解析、身份证扫描、医疗摘要、编辑标记、会议纪要和翻译均基于Claude内置能力运行。

可选 — 仅为文件输出脚本安装

bash

PII编辑为PDF/图像文件（redactor.py）

pip install pymupdf>=1.23 # PDF编辑必需
pip install Pillow>=10.0 # 图像编辑必需
pip install pytesseract>=0.3 # 图像编辑必需（还需：brew install tesseract）

文档扫描/透视校正（doc_scanner.py）

pip install opencv-python-headless>=4.9 numpy>=1.24 Pillow>=10.0 pip install img2pdf>=0.5 # 可选 — 用于PDF输出；缺失时使用Pillow回退

从PDF提取表格（table_extractor.py）

pip install pdfplumber>=0.11

音频转录（audio_transcriber.py）

还需ffmpeg二进制文件：brew install ffmpeg / apt install ffmpeg

pip install openai-whisper>=20231117

所有依赖项也列在仓库根目录的requirements.txt中。

二进制依赖项

二进制文件	被谁需要	安装方式
tesseract	redactor.py（图像模式）	brew install tesseract / apt install tesseract-ocr
ffmpeg

audio_transcriber.py | brew install ffmpeg / apt install ffmpeg |

网络访问

openai-whisper在首次运行时从OpenAI/HuggingFace服务器下载模型文件（约140 MB）。缓存于~/.cache/whisper/。所有其他脚本安装后完全本地运行。

脚本参考

脚本	依赖项	用途	示例
redactor.py	pymupdf；Pillow + pytesseract（图像模式）	将PII编辑为文件（PDF/图像/文本）	python scripts/redactor.py --file doc.pdf --mode full --log
docscanner.py

所有脚本仅导入其声明的依赖项。未声明依赖项的脚本仅使用Python标准库。您可以验证任何脚本：显示[脚本名称]的源代码。

脚本导入验证

脚本	标准库导入	第三方库	网络
timeline_manager.py	argparse, json, sys, datetime, pathlib, uuid, collections	无	从不
redactor.py

隐私与数据处理

方面	政策
文档内容

仅在本会话中本地读取。不存储、

doc-process文档处理

doc-process

Doc-Process — Document Intelligence Skill

Step 0 — Auto-Setup (run once on first use)

Overview

How Features Are Implemented

Dependencies & Installation

No installation required for core functionality

Optional — install only for file-output scripts

Binary dependencies

Network access

Script Reference

Script Import Verification

Privacy & Data Handling

Step 1 — Identify the Mode

Explicit intent → go directly to the matching mode

Ambiguous intent → Document Categorizer (with consent gate)

Step 2 — Read the Document

Step 3 — Execute the Mode

Step 4 — Redactor: PII Rule Coverage

Redaction modes

How PDF redaction works

Step 5 — Doc Scan: How It Works

When auto-detection fails

Step 6 — Document Timeline (Opt-In)

Step 7 — Deliver Output

General Principles

文档处理 — 文档智能技能

第0步 — 自动设置（首次使用时运行一次）

概述

功能实现方式

依赖项与安装

核心功能无需安装

可选 — 仅为文件输出脚本安装

PII编辑为PDF/图像文件（redactor.py）

文档扫描/透视校正（doc_scanner.py）

从PDF提取表格（table_extractor.py）

音频转录（audio_transcriber.py）

还需ffmpeg二进制文件：brew install ffmpeg / apt install ffmpeg

二进制依赖项

网络访问

脚本参考

脚本导入验证

隐私与数据处理

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement