PDF Skill

Complete guide for PDF operations using Python libraries and CLI tools.

⚡ Feature Cheat Sheet

One-line lookup for every supported operation — find the right tool instantly.

What you want to do	Command / Script	One-liner example
📖 Extract text	INLINECODE0	INLINECODE1
📊 Extract tables → Excel

scripts/extract_tables.py | python scripts/extract_tables.py report.pdf -o tables.xlsx | | 🔗 Merge PDFs | scripts/merge_pdfs.py | python scripts/merge_pdfs.py "*.pdf" -o merged.pdf | | ✂️ Split PDF | scripts/split_pdf.py | python scripts/split_pdf.py big.pdf --each | | 🔄 Rotate pages | scripts/batch_convert.py rotate | python scripts/batch_convert.py rotate input.pdf -d 90 | | 🔀 Reorder pages | scripts/reorder_pdf.py | python scripts/reorder_pdf.py input.pdf --order "3,1,2,4-" -o reordered.pdf | | 💧 Add text watermark | scripts/watermark.py | python scripts/watermark.py doc.pdf -t "CONFIDENTIAL" | | 🖼️ Add image watermark | scripts/watermark.py | python scripts/watermark.py doc.pdf --image logo.png --alpha 0.3 | | 🔒 Encrypt PDF | pypdf (inline) | see Password Protect below | | 📝 Fill PDF form | scripts/fill_pdf_form.py | python scripts/fill_pdf_form.py form.pdf -o filled.pdf --set name="Alice" | | 🔍 Check form fields | scripts/check_fillable_fields.py | python scripts/check_fillable_fields.py form.pdf | | 🖼️ OCR scanned PDF | scripts/ocr_pdf.py | python scripts/ocr_pdf.py scan.pdf --lang eng | | 📄 Create PDF from scratch | reportlab (inline) | see references/create.md | | 📦 Batch operations | scripts/batch_convert.py | python scripts/batch_convert.py merge --help | | 📏 Compress / optimize | scripts/compress_pdf.py | python scripts/compress_pdf.py input.pdf -o output.pdf --quality medium | | ℹ️ View PDF info | scripts/pdf_info.py | python scripts/pdf_info.py input.pdf | | 🖼️→📄 Images to PDF | scripts/images_to_pdf.py | python scripts/images_to_pdf.py "photos/*.jpg" -o album.pdf --page-size A4 | | 📄→🖼️ PDF to images | scripts/pdf_to_images.py | python scripts/pdf_to_images.py input.pdf -o pages/ --format png --dpi 200 | | 🔎 Compare two PDFs | scripts/compare_pdf.py | python scripts/compare_pdf.py old.pdf new.pdf -o diff_report.html | | 🔧 Repair corrupted PDF | scripts/repair_pdf.py | python scripts/repair_pdf.py broken.pdf -o fixed.pdf | | 🔤 List fonts | scripts/list_fonts.py | python scripts/list_fonts.py input.pdf |

💡 Run any script with --help to see all available options.

Quick Decision Guide

CODEBLOCK0

Installation

Linux (Ubuntu/Debian)

CODEBLOCK1

macOS (Homebrew)

CODEBLOCK2

⚠️ macOS 注意: tesseract-lang 必须单独安装，否则中文/日文等非英文 OCR 会失败。安装后运行 tesseract --list-langs 确认可用语言。

Verify Installation

CODEBLOCK3

Core Operations

Read & Extract Text

CODEBLOCK4

→ For advanced extraction options, see references/extract.md

Extract Tables → DataFrame

CODEBLOCK5

Merge PDFs

CODEBLOCK6

Split PDF

CODEBLOCK7

Rotate Pages

CODEBLOCK8

Password Protect

CODEBLOCK9

CLI Quick Reference (qpdf)

CODEBLOCK10

Available Scripts

Use these scripts directly — no need to rewrite from scratch:

Script	Purpose
INLINECODE41	Extract all text, page by page, to .txt
INLINECODE42

Run any script with --help to see its options.

Reference Files

Load these when you need deeper guidance:

- references/create.md — Building PDFs from scratch with reportlab (Platypus, Canvas, styles, tables, headers/footers)
references/extract.md — Advanced text/table/image extraction, coordinate-based cropping, word-level data
references/security.md — Watermarks, encryption, permissions, digital signatures
references/ocr.md — OCR pipeline, language packs, image preprocessing, quality tuning
FORMS.md — Complete guide to PDF form filling (AcroForm + XFA, pdf-lib JS)

Quick Reference Table

Task	Best Tool	Key Method
Extract text	pdfplumber	INLINECODE60
Extract tables

Common Pitfalls

- Never use Unicode subscripts/superscripts (₂, ⁰) in reportlab — use <sub> / <super> XML tags instead, or they render as black boxes
pdfplumber, not pypdf, for text extraction — pypdf's extract_text() loses layout; pdfplumber is layout-aware
Encrypted PDFs: pass password= to PdfReader() and INLINECODE81
pip in sandbox: always add --break-system-packages flag
qpdf for speed: for large batch jobs, prefer qpdf CLI over Python loops
macOS OCR 语言包: brew install tesseract 仅含英文；非英文 OCR 需额外执行 INLINECODE84
macOS 系统依赖: OCR 和 CLI 操作需先安装 INLINECODE85
测试表单填充: 没有可填写 PDF 时，先运行 python scripts/create_test_form.py 生成测试表单
OCR vs pdfplumber: OCR 只适用于扫描件（图片型 PDF）。对原生文本 PDF 提取内容，应使用 pdfplumber（更快更准）
中文表单填充: pypdf 内置字体不支持 CJK 字符，中文值可能显示为方块。需要中文表单填充时，使用 pdf-lib (JS) 方案（见 FORMS.md）
旋转页面: 没有独立 rotate 脚本，使用 INLINECODE88

⛔ Limitations (Not Suitable For)

场景	原因	替代方案
复杂排版 PDF（杂志、海报）	提取会丢失格式布局	使用专业排版工具
扫描件中的表格提取

PDF 技能

使用 Python 库和 CLI 工具进行 PDF 操作的完整指南。

⚡ 功能速查表

所有支持操作的一行速查——快速找到合适的工具。

你想做什么	命令/脚本	一行示例
📖 提取文本	scripts/extracttext.py	python scripts/extracttext.py doc.pdf
📊 提取表格 → Excel

scripts/extracttables.py | python scripts/extracttables.py report.pdf -o tables.xlsx | | 🔗 合并 PDF | scripts/mergepdfs.py | python scripts/mergepdfs.py *.pdf -o merged.pdf | | ✂️ 拆分 PDF | scripts/splitpdf.py | python scripts/splitpdf.py big.pdf --each | | 🔄 旋转页面 | scripts/batchconvert.py rotate | python scripts/batchconvert.py rotate input.pdf -d 90 | | 🔀 重新排序页面 | scripts/reorderpdf.py | python scripts/reorderpdf.py input.pdf --order 3,1,2,4- -o reordered.pdf | | 💧 添加文字水印 | scripts/watermark.py | python scripts/watermark.py doc.pdf -t 机密 | | 🖼️ 添加图片水印 | scripts/watermark.py | python scripts/watermark.py doc.pdf --image logo.png --alpha 0.3 | | 🔒 加密 PDF | pypdf（内联） | 参见下方密码保护 | | 📝 填写 PDF 表单 | scripts/fillpdfform.py | python scripts/fillpdfform.py form.pdf -o filled.pdf --set name=Alice | | 🔍 检查表单字段 | scripts/checkfillablefields.py | python scripts/checkfillablefields.py form.pdf | | 🖼️ OCR 扫描版 PDF | scripts/ocrpdf.py | python scripts/ocrpdf.py scan.pdf --lang eng | | 📄 从头创建 PDF | reportlab（内联） | 参见 references/create.md | | 📦 批量操作 | scripts/batchconvert.py | python scripts/batchconvert.py merge --help | | 📏 压缩/优化 | scripts/compresspdf.py | python scripts/compresspdf.py input.pdf -o output.pdf --quality medium | | ℹ️ 查看 PDF 信息 | scripts/pdfinfo.py | python scripts/pdfinfo.py input.pdf | | 🖼️→📄 图片转 PDF | scripts/imagestopdf.py | python scripts/imagestopdf.py photos/*.jpg -o album.pdf --page-size A4 | | 📄→🖼️ PDF 转图片 | scripts/pdftoimages.py | python scripts/pdftoimages.py input.pdf -o pages/ --format png --dpi 200 | | 🔎 比较两个 PDF | scripts/comparepdf.py | python scripts/comparepdf.py old.pdf new.pdf -o diff_report.html | | 🔧 修复损坏的 PDF | scripts/repairpdf.py | python scripts/repairpdf.py broken.pdf -o fixed.pdf | | 🔤 列出字体 | scripts/listfonts.py | python scripts/listfonts.py input.pdf |

💡 使用 --help 运行任何脚本以查看所有可用选项。

快速决策指南

你需要什么？
├── 从头创建新 PDF → reportlab（参见 references/create.md）
├── 提取文本/表格 → pdfplumber（参见 references/extract.md）
├── 合并/拆分/旋转页面 → pypdf 或 qpdf CLI
├── 重新排序页面 → scripts/reorder_pdf.py
├── 添加水印/加密/保护 → pypdf
├── 填写 PDF 表单 → pdf-lib (JS) 或 pypdf（参见 FORMS.md）
├── 从 PDF 提取图片 → pdfimages CLI 或 pypdf
├── OCR 扫描版 PDF → pdf2image + pytesseract
├── 压缩/减小文件大小 → scripts/compress_pdf.py (qpdf + pypdf)
├── 查看 PDF 信息/元数据 → scripts/pdf_info.py
├── 图片转 PDF → scripts/imagestopdf.py (reportlab)
├── PDF 转图片 → scripts/pdftoimages.py (pdf2image)
├── 比较/差异两个 PDF → scripts/compare_pdf.py
├── 修复损坏的 PDF → scripts/repair_pdf.py (qpdf + pypdf)
└── 列出 PDF 中的字体 → scripts/list_fonts.py

安装

Linux (Ubuntu/Debian)

bash

Python 库

pip install pypdf pdfplumber reportlab pdf2image pytesseract Pillow --break-system-packages

系统工具

sudo apt-get install -y poppler-utils tesseract-ocr qpdf

中文 OCR

sudo apt-get install -y tesseract-ocr-chi-sim tesseract-ocr-chi-tra

Node.js（表单填写）

npm install pdf-lib

macOS (Homebrew)

bash

系统工具（OCR 和 CLI 操作必需）

brew install qpdf poppler tesseract

重要：非英文 OCR 必须单独安装语言包

brew install tesseract-lang

Python 库

pip install pypdf pdfplumber reportlab pdf2image pytesseract Pillow --break-system-packages

Node.js（表单填写）

npm install pdf-lib

⚠️ macOS 注意: tesseract-lang 必须单独安装，否则中文/日文等非英文 OCR 会失败。安装后运行 tesseract --list-langs 确认可用语言。

验证安装

bash

检查 Python 库

python3 -c import pypdf, pdfplumber, reportlab, PIL; print(✓ Python libs OK)

检查系统工具

which qpdf && echo ✓ qpdf OK || echo ✗ qpdf not installed which tesseract && echo ✓ tesseract OK || echo ✗ tesseract not installed which pdftotext && echo ✓ poppler OK || echo ✗ poppler not installed

检查 OCR 语言

tesseract --list-langs 2>/dev/null | head -5

核心操作

读取和提取文本

python
import pdfplumber

with pdfplumber.open(document.pdf) as pdf:
for page in pdf.pages:
print(page.extract_text())

→ 高级提取选项，参见 references/extract.md

提取表格 → DataFrame

python
import pdfplumber, pandas as pd

with pdfplumber.open(report.pdf) as pdf:
for page in pdf.pages:
for table in page.extract_tables():
df = pd.DataFrame(table[1:], columns=table[0])
print(df)

合并 PDF

python
from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for path in [a.pdf, b.pdf, c.pdf]:
writer.append(PdfReader(path))
with open(merged.pdf, wb) as f:
writer.write(f)

拆分 PDF

python
from pypdf import PdfReader, PdfWriter

reader = PdfReader(input.pdf)
for i, page in enumerate(reader.pages):
w = PdfWriter()
w.add_page(page)
with open(fpage_{i+1}.pdf, wb) as f:
w.write(f)

旋转页面

python
reader = PdfReader(scan.pdf)
writer = PdfWriter()
for page in reader.pages:
page.rotate(90) # 90 / 180 / 270
writer.add_page(page)
with open(rotated.pdf, wb) as f:
writer.write(f)

密码保护

python
from pypdf import PdfReader, PdfWriter

reader = PdfReader(doc.pdf)
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
writer.encrypt(userpass, ownerpass, use_128bit=False) # AES-256
with open(encrypted.pdf, wb) as f:
writer.write(f)

CLI 快速参考 (qpdf)

bash

合并

qpdf --empty --pages a.pdf b.pdf -- merged.pdf

提取第 1-5 页

qpdf input.pdf --pages . 1-5 -- out.pdf

docs-pdf文档转PDF

docs-pdf

PDF Skill

⚡ Feature Cheat Sheet

Quick Decision Guide

Installation

Linux (Ubuntu/Debian)

macOS (Homebrew)

Verify Installation

Core Operations

Read & Extract Text

Extract Tables → DataFrame

Merge PDFs

Split PDF

Rotate Pages

Password Protect

CLI Quick Reference (qpdf)

Available Scripts

Reference Files

Quick Reference Table

Common Pitfalls

⛔ Limitations (Not Suitable For)

PDF 技能

⚡ 功能速查表

快速决策指南

安装

Linux (Ubuntu/Debian)

Python 库

系统工具

中文 OCR

Node.js（表单填写）

macOS (Homebrew)

系统工具（OCR 和 CLI 操作必需）

重要：非英文 OCR 必须单独安装语言包

Python 库

Node.js（表单填写）

验证安装

检查 Python 库

检查系统工具

检查 OCR 语言

核心操作

读取和提取文本

提取表格 → DataFrame

合并 PDF

拆分 PDF

旋转页面

密码保护

CLI 快速参考 (qpdf)

合并

提取第 1-5 页

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement