PDF Skill
Complete guide for PDF operations using Python libraries and CLI tools.
⚡ Feature Cheat Sheet
One-line lookup for every supported operation — find the right tool instantly.
| What you want to do | Command / Script | One-liner example |
|---|
| 📖 Extract text | INLINECODE0 | INLINECODE1 |
| 📊 Extract tables → Excel |
scripts/extract_tables.py |
python scripts/extract_tables.py report.pdf -o tables.xlsx |
| 🔗 Merge PDFs |
scripts/merge_pdfs.py |
python scripts/merge_pdfs.py "*.pdf" -o merged.pdf |
| ✂️ Split PDF |
scripts/split_pdf.py |
python scripts/split_pdf.py big.pdf --each |
| 🔄 Rotate pages |
scripts/batch_convert.py rotate |
python scripts/batch_convert.py rotate input.pdf -d 90 |
| 🔀 Reorder pages |
scripts/reorder_pdf.py |
python scripts/reorder_pdf.py input.pdf --order "3,1,2,4-" -o reordered.pdf |
| 💧 Add text watermark |
scripts/watermark.py |
python scripts/watermark.py doc.pdf -t "CONFIDENTIAL" |
| 🖼️ Add image watermark |
scripts/watermark.py |
python scripts/watermark.py doc.pdf --image logo.png --alpha 0.3 |
| 🔒 Encrypt PDF | pypdf (inline) | see
Password Protect below |
| 📝 Fill PDF form |
scripts/fill_pdf_form.py |
python scripts/fill_pdf_form.py form.pdf -o filled.pdf --set name="Alice" |
| 🔍 Check form fields |
scripts/check_fillable_fields.py |
python scripts/check_fillable_fields.py form.pdf |
| 🖼️ OCR scanned PDF |
scripts/ocr_pdf.py |
python scripts/ocr_pdf.py scan.pdf --lang eng |
| 📄 Create PDF from scratch | reportlab (inline) | see
references/create.md |
| 📦 Batch operations |
scripts/batch_convert.py |
python scripts/batch_convert.py merge --help |
| 📏 Compress / optimize |
scripts/compress_pdf.py |
python scripts/compress_pdf.py input.pdf -o output.pdf --quality medium |
| ℹ️ View PDF info |
scripts/pdf_info.py |
python scripts/pdf_info.py input.pdf |
| 🖼️→📄 Images to PDF |
scripts/images_to_pdf.py |
python scripts/images_to_pdf.py "photos/*.jpg" -o album.pdf --page-size A4 |
| 📄→🖼️ PDF to images |
scripts/pdf_to_images.py |
python scripts/pdf_to_images.py input.pdf -o pages/ --format png --dpi 200 |
| 🔎 Compare two PDFs |
scripts/compare_pdf.py |
python scripts/compare_pdf.py old.pdf new.pdf -o diff_report.html |
| 🔧 Repair corrupted PDF |
scripts/repair_pdf.py |
python scripts/repair_pdf.py broken.pdf -o fixed.pdf |
| 🔤 List fonts |
scripts/list_fonts.py |
python scripts/list_fonts.py input.pdf |
💡 Run any script with --help to see all available options.
Quick Decision Guide
CODEBLOCK0
Installation
Linux (Ubuntu/Debian)
CODEBLOCK1
macOS (Homebrew)
CODEBLOCK2
⚠️ macOS 注意: tesseract-lang 必须单独安装,否则中文/日文等非英文 OCR 会失败。安装后运行 tesseract --list-langs 确认可用语言。
Verify Installation
CODEBLOCK3
Core Operations
Read & Extract Text
CODEBLOCK4
→ For advanced extraction options, see references/extract.md
Extract Tables → DataFrame
CODEBLOCK5
Merge PDFs
CODEBLOCK6
Split PDF
CODEBLOCK7
Rotate Pages
CODEBLOCK8
Password Protect
CODEBLOCK9
CLI Quick Reference (qpdf)
CODEBLOCK10
Available Scripts
Use these scripts directly — no need to rewrite from scratch:
| Script | Purpose |
|---|
| INLINECODE41 | Extract all text, page by page, to .txt |
| INLINECODE42 |
Extract all tables to .xlsx |
|
scripts/merge_pdfs.py | Merge multiple PDFs from a glob pattern |
|
scripts/split_pdf.py | Split by page ranges |
|
scripts/reorder_pdf.py | Reorder pages (flexible syntax: "3,1,2,4-") |
|
scripts/watermark.py | Add text or image watermark |
|
scripts/ocr_pdf.py | Full OCR pipeline for scanned PDFs |
|
scripts/batch_convert.py | Batch operations (merge/split/rotate) CLI |
|
scripts/check_fillable_fields.py | List all form fields in a PDF |
|
scripts/fill_pdf_form.py | Fill AcroForm fields programmatically |
|
scripts/create_test_form.py | Generate a sample fillable PDF form for testing |
|
scripts/compress_pdf.py | Compress / optimize PDF to reduce file size |
|
scripts/pdf_info.py | View PDF metadata, page count, encryption, fonts |
|
scripts/images_to_pdf.py | Convert images (JPG/PNG/etc.) to PDF |
|
scripts/pdf_to_images.py | Convert PDF pages to PNG/JPEG images |
|
scripts/compare_pdf.py | Compare two PDFs and generate diff report |
|
scripts/repair_pdf.py | Attempt to repair corrupted PDF files |
|
scripts/list_fonts.py | List all fonts used in a PDF |
Run any script with --help to see its options.
Reference Files
Load these when you need deeper guidance:
- - references/create.md — Building PDFs from scratch with reportlab (Platypus, Canvas, styles, tables, headers/footers)
- references/extract.md — Advanced text/table/image extraction, coordinate-based cropping, word-level data
- references/security.md — Watermarks, encryption, permissions, digital signatures
- references/ocr.md — OCR pipeline, language packs, image preprocessing, quality tuning
- FORMS.md — Complete guide to PDF form filling (AcroForm + XFA, pdf-lib JS)
Quick Reference Table
| Task | Best Tool | Key Method |
|---|
| Extract text | pdfplumber | INLINECODE60 |
| Extract tables |
pdfplumber |
page.extract_tables() |
| Merge PDFs | pypdf |
writer.append() |
| Split PDFs | pypdf | one page per writer |
| Rotate pages | pypdf |
page.rotate(90) |
| Reorder pages | pypdf |
writer.add_page(reader.pages[i]) |
| Create PDF | reportlab | Platypus or Canvas |
| Watermark | pypdf + reportlab |
page.merge_page() |
| Encrypt | pypdf |
writer.encrypt() |
| Fill form | pypdf / pdf-lib | see FORMS.md |
| OCR scanned | pytesseract | see references/ocr.md |
| Compress PDF | qpdf + pypdf |
compress_identical_objects() |
| View PDF info | pypdf |
PdfReader metadata + fields |
| Images → PDF | reportlab |
canvas.drawImage() |
| PDF → images | pdf2image |
convert_from_path() |
| Compare PDFs | pdfplumber + difflib | text diff per page |
| Repair PDF | qpdf / pypdf |
qpdf --linearize or re-write |
| List fonts | pypdf | page
/Resources →
/Font |
| CLI merge | qpdf |
--empty --pages |
| Extract images | pypdf / pdfimages |
page.images |
Common Pitfalls
- - Never use Unicode subscripts/superscripts (₂, ⁰) in reportlab — use
<sub> / <super> XML tags instead, or they render as black boxes - pdfplumber, not pypdf, for text extraction — pypdf's
extract_text() loses layout; pdfplumber is layout-aware - Encrypted PDFs: pass
password= to PdfReader() and INLINECODE81 - pip in sandbox: always add
--break-system-packages flag - qpdf for speed: for large batch jobs, prefer qpdf CLI over Python loops
- macOS OCR 语言包:
brew install tesseract 仅含英文;非英文 OCR 需额外执行 INLINECODE84 - macOS 系统依赖: OCR 和 CLI 操作需先安装 INLINECODE85
- 测试表单填充: 没有可填写 PDF 时,先运行
python scripts/create_test_form.py 生成测试表单 - OCR vs pdfplumber: OCR 只适用于扫描件(图片型 PDF)。对原生文本 PDF 提取内容,应使用
pdfplumber(更快更准) - 中文表单填充: pypdf 内置字体不支持 CJK 字符,中文值可能显示为方块。需要中文表单填充时,使用 pdf-lib (JS) 方案(见 FORMS.md)
- 旋转页面: 没有独立 rotate 脚本,使用 INLINECODE88
⛔ Limitations (Not Suitable For)
| 场景 | 原因 | 替代方案 |
|---|
| 复杂排版 PDF(杂志、海报) | 提取会丢失格式布局 | 使用专业排版工具 |
| 扫描件中的表格提取 |
OCR 表格精度有限 | 使用专业表格识别工具如 Camelot |
| CJK 字符的表单填充 | pypdf 内置字体不含 CJK | 使用 pdf-lib (JS),见 FORMS.md |
| 超大 PDF (>500MB) | 内存可能不足 | 用 qpdf CLI 或分批处理 |
PDF 技能
使用 Python 库和 CLI 工具进行 PDF 操作的完整指南。
⚡ 功能速查表
所有支持操作的一行速查——快速找到合适的工具。
| 你想做什么 | 命令/脚本 | 一行示例 |
|---|
| 📖 提取文本 | scripts/extracttext.py | python scripts/extracttext.py doc.pdf |
| 📊 提取表格 → Excel |
scripts/extract
tables.py | python scripts/extracttables.py report.pdf -o tables.xlsx |
| 🔗 合并 PDF | scripts/merge
pdfs.py | python scripts/mergepdfs.py *.pdf -o merged.pdf |
| ✂️ 拆分 PDF | scripts/split
pdf.py | python scripts/splitpdf.py big.pdf --each |
| 🔄 旋转页面 | scripts/batch
convert.py rotate | python scripts/batchconvert.py rotate input.pdf -d 90 |
| 🔀 重新排序页面 | scripts/reorder
pdf.py | python scripts/reorderpdf.py input.pdf --order 3,1,2,4- -o reordered.pdf |
| 💧 添加文字水印 | scripts/watermark.py | python scripts/watermark.py doc.pdf -t 机密 |
| 🖼️ 添加图片水印 | scripts/watermark.py | python scripts/watermark.py doc.pdf --image logo.png --alpha 0.3 |
| 🔒 加密 PDF | pypdf(内联) | 参见下方
密码保护 |
| 📝 填写 PDF 表单 | scripts/fill
pdfform.py | python scripts/fill
pdfform.py form.pdf -o filled.pdf --set name=Alice |
| 🔍 检查表单字段 | scripts/check
fillablefields.py | python scripts/check
fillablefields.py form.pdf |
| 🖼️ OCR 扫描版 PDF | scripts/ocr
pdf.py | python scripts/ocrpdf.py scan.pdf --lang eng |
| 📄 从头创建 PDF | reportlab(内联) | 参见
references/create.md |
| 📦 批量操作 | scripts/batch
convert.py | python scripts/batchconvert.py merge --help |
| 📏 压缩/优化 | scripts/compress
pdf.py | python scripts/compresspdf.py input.pdf -o output.pdf --quality medium |
| ℹ️ 查看 PDF 信息 | scripts/pdf
info.py | python scripts/pdfinfo.py input.pdf |
| 🖼️→📄 图片转 PDF | scripts/images
topdf.py | python scripts/images
topdf.py photos/*.jpg -o album.pdf --page-size A4 |
| 📄→🖼️ PDF 转图片 | scripts/pdf
toimages.py | python scripts/pdf
toimages.py input.pdf -o pages/ --format png --dpi 200 |
| 🔎 比较两个 PDF | scripts/compare
pdf.py | python scripts/comparepdf.py old.pdf new.pdf -o diff_report.html |
| 🔧 修复损坏的 PDF | scripts/repair
pdf.py | python scripts/repairpdf.py broken.pdf -o fixed.pdf |
| 🔤 列出字体 | scripts/list
fonts.py | python scripts/listfonts.py input.pdf |
💡 使用 --help 运行任何脚本以查看所有可用选项。
快速决策指南
你需要什么?
├── 从头创建新 PDF → reportlab(参见 references/create.md)
├── 提取文本/表格 → pdfplumber(参见 references/extract.md)
├── 合并/拆分/旋转页面 → pypdf 或 qpdf CLI
├── 重新排序页面 → scripts/reorder_pdf.py
├── 添加水印/加密/保护 → pypdf
├── 填写 PDF 表单 → pdf-lib (JS) 或 pypdf(参见 FORMS.md)
├── 从 PDF 提取图片 → pdfimages CLI 或 pypdf
├── OCR 扫描版 PDF → pdf2image + pytesseract
├── 压缩/减小文件大小 → scripts/compress_pdf.py (qpdf + pypdf)
├── 查看 PDF 信息/元数据 → scripts/pdf_info.py
├── 图片转 PDF → scripts/imagestopdf.py (reportlab)
├── PDF 转图片 → scripts/pdftoimages.py (pdf2image)
├── 比较/差异两个 PDF → scripts/compare_pdf.py
├── 修复损坏的 PDF → scripts/repair_pdf.py (qpdf + pypdf)
└── 列出 PDF 中的字体 → scripts/list_fonts.py
安装
Linux (Ubuntu/Debian)
bash
Python 库
pip install pypdf pdfplumber reportlab pdf2image pytesseract Pillow --break-system-packages
系统工具
sudo apt-get install -y poppler-utils tesseract-ocr qpdf
中文 OCR
sudo apt-get install -y tesseract-ocr-chi-sim tesseract-ocr-chi-tra
Node.js(表单填写)
npm install pdf-lib
macOS (Homebrew)
bash
系统工具(OCR 和 CLI 操作必需)
brew install qpdf poppler tesseract
重要:非英文 OCR 必须单独安装语言包
brew install tesseract-lang
Python 库
pip install pypdf pdfplumber reportlab pdf2image pytesseract Pillow --break-system-packages
Node.js(表单填写)
npm install pdf-lib
⚠️ macOS 注意: tesseract-lang 必须单独安装,否则中文/日文等非英文 OCR 会失败。安装后运行 tesseract --list-langs 确认可用语言。
验证安装
bash
检查 Python 库
python3 -c import pypdf, pdfplumber, reportlab, PIL; print(✓ Python libs OK)
检查系统工具
which qpdf && echo ✓ qpdf OK || echo ✗ qpdf not installed
which tesseract && echo ✓ tesseract OK || echo ✗ tesseract not installed
which pdftotext && echo ✓ poppler OK || echo ✗ poppler not installed
检查 OCR 语言
tesseract --list-langs 2>/dev/null | head -5
核心操作
读取和提取文本
python
import pdfplumber
with pdfplumber.open(document.pdf) as pdf:
for page in pdf.pages:
print(page.extract_text())
→ 高级提取选项,参见 references/extract.md
提取表格 → DataFrame
python
import pdfplumber, pandas as pd
with pdfplumber.open(report.pdf) as pdf:
for page in pdf.pages:
for table in page.extract_tables():
df = pd.DataFrame(table[1:], columns=table[0])
print(df)
合并 PDF
python
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for path in [a.pdf, b.pdf, c.pdf]:
writer.append(PdfReader(path))
with open(merged.pdf, wb) as f:
writer.write(f)
拆分 PDF
python
from pypdf import PdfReader, PdfWriter
reader = PdfReader(input.pdf)
for i, page in enumerate(reader.pages):
w = PdfWriter()
w.add_page(page)
with open(fpage_{i+1}.pdf, wb) as f:
w.write(f)
旋转页面
python
reader = PdfReader(scan.pdf)
writer = PdfWriter()
for page in reader.pages:
page.rotate(90) # 90 / 180 / 270
writer.add_page(page)
with open(rotated.pdf, wb) as f:
writer.write(f)
密码保护
python
from pypdf import PdfReader, PdfWriter
reader = PdfReader(doc.pdf)
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.encrypt(userpass, ownerpass, use_128bit=False) # AES-256
with open(encrypted.pdf, wb) as f:
writer.write(f)
CLI 快速参考 (qpdf)
bash
合并
qpdf --empty --pages a.pdf b.pdf -- merged.pdf
提取第 1-5 页
qpdf input.pdf --pages . 1-5 -- out.pdf