返回顶部
p

pdfPDF处理

Use this skill whenever the user wants to do anything with PDF files. This includes reading or extracting text/tables from PDFs, combining or merging multiple PDFs into one, splitting PDFs apart, rotating pages, adding watermarks, creating new PDFs, filling PDF forms, encrypting/decrypting PDFs, extracting images, and OCR on scanned PDFs to make them searchable. If the user mentions a .pdf file or asks to produce one, use this skill.

作者: admin | 来源: ClawHub
源自
ClawHub
版本
V 1.0.0
安全检测
已通过
236
下载量
免费
免费
0
收藏
概述
安装方式
版本历史

pdf

PDF处理指南

概述

本指南涵盖了使用Python库和命令行工具进行基本PDF处理操作的内容。如需了解高级功能、JavaScript库及详细示例,请参阅REFERENCE.md。如需填写PDF表单,请阅读FORMS.md并遵循其说明。

快速入门

python
from pypdf import PdfReader, PdfWriter

读取PDF

reader = PdfReader(document.pdf) print(f页数: {len(reader.pages)})

提取文本

text = for page in reader.pages: text += page.extract_text()

Python库

pypdf - 基本操作

合并PDF

python from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in [doc1.pdf, doc2.pdf, doc3.pdf]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)

with open(merged.pdf, wb) as output:
writer.write(output)

拆分PDF

python reader = PdfReader(input.pdf) for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) with open(fpage_{i+1}.pdf, wb) as output: writer.write(output)

提取元数据

python reader = PdfReader(document.pdf) meta = reader.metadata print(f标题: {meta.title}) print(f作者: {meta.author}) print(f主题: {meta.subject}) print(f创建者: {meta.creator})

旋转页面

python reader = PdfReader(input.pdf) writer = PdfWriter()

page = reader.pages[0]
page.rotate(90) # 顺时针旋转90度
writer.add_page(page)

with open(rotated.pdf, wb) as output:
writer.write(output)

pdfplumber - 文本和表格提取

提取带布局的文本

python import pdfplumber

with pdfplumber.open(document.pdf) as pdf:
for page in pdf.pages:
text = page.extract_text()
print(text)

提取表格

python with pdfplumber.open(document.pdf) as pdf: for i, page in enumerate(pdf.pages): tables = page.extract_tables() for j, table in enumerate(tables): print(f第{i+1}页的表格{j+1}:) for row in table: print(row)

高级表格提取

python import pandas as pd

with pdfplumber.open(document.pdf) as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table: # 检查表格是否为空
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)

合并所有表格

if all_tables: combineddf = pd.concat(alltables, ignore_index=True) combineddf.toexcel(extracted_tables.xlsx, index=False)

reportlab - 创建PDF

基本PDF创建

python from reportlab.lib.pagesizes import letter from reportlab.pdfgen import canvas

c = canvas.Canvas(hello.pdf, pagesize=letter)
width, height = letter

添加文本

c.drawString(100, height - 100, Hello World!) c.drawString(100, height - 120, 这是使用reportlab创建的PDF)

添加线条

c.line(100, height - 140, 400, height - 140)

保存

c.save()

创建多页PDF

python from reportlab.lib.pagesizes import letter from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate(report.pdf, pagesize=letter)
styles = getSampleStyleSheet()
story = []

添加内容

title = Paragraph(报告标题, styles[Title]) story.append(title) story.append(Spacer(1, 12))

body = Paragraph(这是报告正文。 * 20, styles[Normal])
story.append(body)
story.append(PageBreak())

第2页

story.append(Paragraph(第2页, styles[Heading1])) story.append(Paragraph(第2页的内容, styles[Normal]))

构建PDF

doc.build(story)

下标和上标

重要提示:切勿在ReportLab PDF中使用Unicode下标/上标字符(₀₁₂₃₄₅₆₇₈₉, ⁰¹²³⁴⁵⁶⁷⁸⁹)。内置字体不包含这些字形,会导致它们渲染为实心黑色方块。

请改用ReportLab Paragraph对象中的XML标记标签:
python
from reportlab.platypus import Paragraph
from reportlab.lib.styles import getSampleStyleSheet

styles = getSampleStyleSheet()

下标:使用标签

chemical = Paragraph(H2O, styles[Normal])

上标:使用标签

squared = Paragraph(x2 + y2, styles[Normal])

对于画布绘制的文本(非Paragraph对象),请手动调整字体大小和位置,而不是使用Unicode下标/上标。

命令行工具

pdftotext (poppler-utils)

bash

提取文本

pdftotext input.pdf output.txt

保留布局提取文本

pdftotext -layout input.pdf output.txt

提取特定页面

pdftotext -f 1 -l 5 input.pdf output.txt # 第1-5页

qpdf

bash

合并PDF

qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

拆分页面

qpdf input.pdf --pages . 1-5 -- pages1-5.pdf qpdf input.pdf --pages . 6-10 -- pages6-10.pdf

旋转页面

qpdf input.pdf output.pdf --rotate=+90:1 # 将第1页旋转90度

移除密码

qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf

pdftk (如果可用)

bash

合并

pdftk file1.pdf file2.pdf cat output merged.pdf

拆分

pdftk input.pdf burst

旋转

pdftk input.pdf rotate 1east output rotated.pdf

常见任务

从扫描PDF中提取文本

python

需要:pip install pytesseract pdf2image

import pytesseract from pdf2image import convertfrompath

将PDF转换为图像

images = convertfrompath(scanned.pdf)

对每页进行OCR

text = for i, image in enumerate(images): text += f第{i+1}页:\n text += pytesseract.imagetostring(image) text += \n\n

print(text)

添加水印

python from pypdf import PdfReader, PdfWriter

创建水印(或加载现有水印)

watermark = PdfReader(watermark.pdf).pages[0]

应用到所有页面

reader = PdfReader(document.pdf) writer = PdfWriter()

for page in reader.pages:
page.merge_page(watermark)
writer.add_page(page)

with open(watermarked.pdf, wb) as output:
writer.write(output)

提取图像

bash

使用pdfimages (poppler-utils)

pdfimages -j input.pdf output_prefix

这将提取所有图像,命名为outputprefix-000.jpg、outputprefix-001.jpg等

密码保护

python from pypdf import PdfReader, PdfWriter

reader = PdfReader(input.pdf)
writer = PdfWriter()

for page in reader.pages:
writer.add_page(page)

添加密码

writer.encrypt(userpassword, ownerpassword)

with open(encrypted.pdf, wb) as output:
writer.write(output)

快速参考

任务最佳工具命令/代码
合并PDFpypdfwriter.add_page(page)
拆分PDF
pypdf | 每页一个文件 | | 提取文本 | pdfplumber | page.extract_text() | | 提取表格 | pdfplumber | page.extract_tables() | | 创建PDF | reportlab | Canvas或Platypus | | 命令行合并 | qpdf | qpdf --empty --pages ... | | OCR扫描PDF | pytesseract | 先转换为图像 | | 填写PDF表单 | pdf-lib或pypdf(见FORMS.md) | 见FORMS.md |

后续步骤

-

标签

skill ai

通过对话安装

该技能支持在以下平台通过对话安装:

OpenClaw WorkBuddy QClaw Kimi Claude

方式一:安装 SkillHub 和技能

帮我安装 SkillHub 和 pdf-anthropic-1776089284 技能

方式二:设置 SkillHub 为优先技能安装源

设置 SkillHub 为我的优先技能安装源,然后帮我安装 pdf-anthropic-1776089284 技能

通过命令行安装

skillhub install pdf-anthropic-1776089284

下载

⬇ 下载 pdf v1.0.0(免费)

文件大小: 21.85 KB | 发布时间: 2026-4-15 13:51

v1.0.0 最新 2026-4-15 13:51
- Initial release of the PDF skill with comprehensive support for common PDF operations.
- Coverage includes extracting text/tables, merging/splitting PDFs, rotating pages, watermarking, form filling, encryption, and OCR.
- Provides concise Python and command-line tool examples for each task using pypdf, pdfplumber, reportlab, qpdf, pdftk, and poppler-utils.
- Special instructions for handling subscripts/superscripts in PDF creation using reportlab.
- Quick reference table summarizes the best tools and commands for frequent PDF actions.
- Directions to further resources (REFERENCE.md and FORMS.md) for advanced or form-specific tasks.

Archiver·手机版·闲社网·闲社论坛·羊毛社区· 多链控股集团有限公司 · 苏ICP备2025199260号-1

Powered by Discuz! X5.0   © 2024-2025 闲社网·线报更新论坛·羊毛分享社区·http://xianshe.com

p2p_official_large
返回顶部