PDF Processing Guide

Overview

This guide covers essential PDF processing operations using Python libraries and command-line tools. For advanced features, JavaScript libraries, and detailed examples, see REFERENCE.md. If you need to fill out a PDF form, read FORMS.md and follow its instructions.

Quick Start

CODEBLOCK0

Python Libraries

pypdf - Basic Operations

Merge PDFs

CODEBLOCK1

Split PDF

CODEBLOCK2

Extract Metadata

CODEBLOCK3

Rotate Pages

CODEBLOCK4

pdfplumber - Text and Table Extraction

Extract Text with Layout

CODEBLOCK5

Extract Tables

CODEBLOCK6

Advanced Table Extraction

CODEBLOCK7

reportlab - Create PDFs

Basic PDF Creation

CODEBLOCK8

Create PDF with Multiple Pages

CODEBLOCK9

Subscripts and Superscripts

IMPORTANT: Never use Unicode subscript/superscript characters (₀₁₂₃₄₅₆₇₈₉, ⁰¹²³⁴⁵⁶⁷⁸⁹) in ReportLab PDFs. The built-in fonts do not include these glyphs, causing them to render as solid black boxes.

Instead, use ReportLab's XML markup tags in Paragraph objects:
CODEBLOCK10

For canvas-drawn text (not Paragraph objects), manually adjust font the size and position rather than using Unicode subscripts/superscripts.

Command-Line Tools

pdftotext (poppler-utils)

CODEBLOCK11

qpdf

CODEBLOCK12

pdftk (if available)

CODEBLOCK13

Common Tasks

Extract Text from Scanned PDFs

CODEBLOCK14

Add Watermark

CODEBLOCK15

Extract Images

CODEBLOCK16

Password Protection

CODEBLOCK17

Quick Reference

Task	Best Tool	Command/Code
Merge PDFs	pypdf	INLINECODE0
Split PDFs

Next Steps

- For advanced pypdfium2 usage, see REFERENCE.md
For JavaScript libraries (pdf-lib), see REFERENCE.md
If you need to fill out a PDF form, follow the instructions in FORMS.md
For troubleshooting guides, see REFERENCE.md

PDF处理指南

概述

本指南涵盖了使用Python库和命令行工具进行基本PDF处理操作的内容。如需了解高级功能、JavaScript库及详细示例，请参阅REFERENCE.md。如需填写PDF表单，请阅读FORMS.md并遵循其说明。

快速入门

python
from pypdf import PdfReader, PdfWriter

读取PDF

reader = PdfReader(document.pdf) print(f页数: {len(reader.pages)})

提取文本

text = for page in reader.pages: text += page.extract_text()

Python库

pypdf - 基本操作

合并PDF

python from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in [doc1.pdf, doc2.pdf, doc3.pdf]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)

with open(merged.pdf, wb) as output:
writer.write(output)

拆分PDF

python reader = PdfReader(input.pdf) for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) with open(fpage_{i+1}.pdf, wb) as output: writer.write(output)

提取元数据

python reader = PdfReader(document.pdf) meta = reader.metadata print(f标题: {meta.title}) print(f作者: {meta.author}) print(f主题: {meta.subject}) print(f创建者: {meta.creator})

旋转页面

python reader = PdfReader(input.pdf) writer = PdfWriter()

page = reader.pages[0]
page.rotate(90) # 顺时针旋转90度
writer.add_page(page)

with open(rotated.pdf, wb) as output:
writer.write(output)

pdfplumber - 文本和表格提取

提取带布局的文本

python import pdfplumber

with pdfplumber.open(document.pdf) as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)

提取表格

python with pdfplumber.open(document.pdf) as pdf: for i, page in enumerate(pdf.pages): tables = page.extract_tables() for j, table in enumerate(tables): print(f第{i+1}页的表格{j+1}:) for row in table: print(row)

高级表格提取

python import pandas as pd

with pdfplumber.open(document.pdf) as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table: # 检查表格是否为空
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)

合并所有表格

if all_tables: combineddf = pd.concat(alltables, ignore_index=True) combineddf.toexcel(extracted_tables.xlsx, index=False)

reportlab - 创建PDF

基本PDF创建

python from reportlab.lib.pagesizes import letter from reportlab.pdfgen import canvas

c = canvas.Canvas(hello.pdf, pagesize=letter)
width, height = letter

添加文本

c.drawString(100, height - 100, Hello World!) c.drawString(100, height - 120, 这是使用reportlab创建的PDF)

添加线条

c.line(100, height - 140, 400, height - 140)

保存

c.save()

创建多页PDF

python from reportlab.lib.pagesizes import letter from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate(report.pdf, pagesize=letter)
styles = getSampleStyleSheet()
story = []

添加内容

title = Paragraph(报告标题, styles[Title]) story.append(title) story.append(Spacer(1, 12))

body = Paragraph(这是报告正文。 * 20, styles[Normal])
story.append(body)
story.append(PageBreak())

第2页

story.append(Paragraph(第2页, styles[Heading1])) story.append(Paragraph(第2页的内容, styles[Normal]))

构建PDF

doc.build(story)

下标和上标

重要提示：切勿在ReportLab PDF中使用Unicode下标/上标字符（₀₁₂₃₄₅₆₇₈₉, ⁰¹²³⁴⁵⁶⁷⁸⁹）。内置字体不包含这些字形，会导致它们渲染为实心黑色方块。

请改用ReportLab Paragraph对象中的XML标记标签：
python
from reportlab.platypus import Paragraph
from reportlab.lib.styles import getSampleStyleSheet

styles = getSampleStyleSheet()

下标：使用_标签

chemical = Paragraph(H₂O, styles[Normal])

上标：使用标签

squared = Paragraph(x2 + y2, styles[Normal])

对于画布绘制的文本（非Paragraph对象），请手动调整字体大小和位置，而不是使用Unicode下标/上标。

命令行工具

pdftotext (poppler-utils)

bash

提取文本

pdftotext input.pdf output.txt

保留布局提取文本

pdftotext -layout input.pdf output.txt

提取特定页面

pdftotext -f 1 -l 5 input.pdf output.txt # 第1-5页

qpdf

bash

合并PDF

qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

拆分页面

qpdf input.pdf --pages . 1-5 -- pages1-5.pdf qpdf input.pdf --pages . 6-10 -- pages6-10.pdf

旋转页面

qpdf input.pdf output.pdf --rotate=+90:1 # 将第1页旋转90度

移除密码

qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf

pdftk (如果可用)

bash

合并

pdftk file1.pdf file2.pdf cat output merged.pdf

拆分

pdftk input.pdf burst

旋转

pdftk input.pdf rotate 1east output rotated.pdf

常见任务

从扫描PDF中提取文本

python

需要：pip install pytesseract pdf2image

import pytesseract from pdf2image import convertfrompath

将PDF转换为图像

images = convertfrompath(scanned.pdf)

对每页进行OCR

text = for i, image in enumerate(images): text += f第{i+1}页:\n text += pytesseract.imagetostring(image) text += \n\n

print(text)

添加水印

python from pypdf import PdfReader, PdfWriter

创建水印（或加载现有水印）

watermark = PdfReader(watermark.pdf).pages[0]

应用到所有页面

reader = PdfReader(document.pdf) writer = PdfWriter()

for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)

with open(watermarked.pdf, wb) as output:
writer.write(output)

提取图像

bash

使用pdfimages (poppler-utils)

pdfimages -j input.pdf output_prefix

这将提取所有图像，命名为outputprefix-000.jpg、outputprefix-001.jpg等

密码保护

python from pypdf import PdfReader, PdfWriter

reader = PdfReader(input.pdf)
writer = PdfWriter()

for page in reader.pages:
writer.add_page(page)

添加密码

writer.encrypt(userpassword, ownerpassword)

with open(encrypted.pdf, wb) as output:
writer.write(output)

快速参考

任务	最佳工具	命令/代码
合并PDF	pypdf	writer.add_page(page)
拆分PDF

pdfPDF处理

pdf

PDF Processing Guide

Overview

Quick Start

Python Libraries

pypdf - Basic Operations

Merge PDFs

Split PDF

Extract Metadata

Rotate Pages

pdfplumber - Text and Table Extraction

Extract Text with Layout

Extract Tables

Advanced Table Extraction

reportlab - Create PDFs

Basic PDF Creation

Create PDF with Multiple Pages

Subscripts and Superscripts

Command-Line Tools

pdftotext (poppler-utils)

qpdf

pdftk (if available)

Common Tasks

Extract Text from Scanned PDFs

Add Watermark

Extract Images

Password Protection

Quick Reference

Next Steps

PDF处理指南

概述

快速入门

读取PDF

提取文本

Python库

pypdf - 基本操作

合并PDF

拆分PDF

提取元数据

旋转页面

pdfplumber - 文本和表格提取

提取带布局的文本

提取表格

高级表格提取

合并所有表格

reportlab - 创建PDF

基本PDF创建

添加文本

添加线条

保存

创建多页PDF

添加内容

第2页

构建PDF

下标和上标

下标：使用标签

上标：使用标签

命令行工具

pdftotext (poppler-utils)

提取文本

保留布局提取文本

提取特定页面

qpdf

合并PDF

拆分页面

旋转页面

移除密码

pdftk (如果可用)

合并

拆分

旋转

常见任务

从扫描PDF中提取文本

需要：pip install pytesseract pdf2image

将PDF转换为图像

对每页进行OCR

添加水印

创建水印（或加载现有水印）

应用到所有页面

下标：使用_标签