返回顶部
E

Extract PDF Text

Extract text from PDF files using PyMuPDF. Parse tables, forms, and complex layouts. Supports OCR for scanned documents.

作者: admin | 来源: ClawHub
源自
ClawHub
版本
V 1.0.2
安全检测
已通过
1,283
下载量
0
收藏
概述
安装方式
版本历史

Extract PDF Text

## When to Use Agent needs to extract text from PDFs. Use PyMuPDF (fitz) for fast local extraction. Works with text-based documents, scanned pages with OCR, forms, and complex layouts. ## Quick Reference | Topic | File | |-------|------| | Code examples | `examples.md` | | OCR setup | `ocr.md` | | Troubleshooting | `troubleshooting.md` | ## Core Rules ### 1. Install PyMuPDF First ```bash pip install PyMuPDF ``` Import as `fitz` (historical name): ```python import fitz # PyMuPDF ``` ### 2. Basic Text Extraction ```python import fitz doc = fitz.open("document.pdf") text = "" for page in doc: text += page.get_text() doc.close() ``` ### 3. Pick the Right Method | PDF Type | Method | |----------|--------| | Text-based | `page.get_text()` — fast, accurate | | Scanned | OCR with pytesseract — slower | | Mixed | Check each page, use OCR when needed | ### 4. Check for Text Before OCR ```python def needs_ocr(page): text = page.get_text().strip() return len(text) < 50 # Likely scanned if very little text ``` ### 5. Handle Errors Gracefully ```python try: doc = fitz.open(path) except fitz.FileDataError: print("Invalid or corrupted PDF") except fitz.PasswordError: doc = fitz.open(path, password="secret") ``` ## Extraction Traps | Trap | What Happens | Fix | |------|--------------|-----| | OCR on text PDF | Slow + worse accuracy | Check `get_text()` first | | Forget to close doc | Memory leak | Use `with` or `doc.close()` | | Assume page order | Wrong reading flow | Use `sort=True` in get_text() | | Ignore encoding | Garbled characters | PyMuPDF handles UTF-8 | ## Scope This skill provides instructions for using PyMuPDF to extract PDF text. This skill ONLY: - Gives code examples for PyMuPDF - Explains OCR setup when needed - Troubleshoots common issues This skill NEVER: - Accesses files without user request - Sends data externally - Modifies original PDFs ## Security & Privacy **All processing is local:** - PyMuPDF runs entirely on your machine - No external API calls - No data leaves your system ## Output Formats ### Plain Text ```python text = page.get_text() ``` ### Structured (dict) ```python blocks = page.get_text("dict")["blocks"] for b in blocks: if b["type"] == 0: # text block for line in b["lines"]: for span in line["spans"]: print(span["text"], span["size"]) ``` ### JSON ```python import json data = page.get_text("json") parsed = json.loads(data) ``` ## Full Example ```python import fitz def extract_pdf(path): """Extract text from PDF, with OCR fallback for scanned pages.""" doc = fitz.open(path) results = [] for i, page in enumerate(doc): text = page.get_text() method = "text" # If very little text, might be scanned if len(text.strip()) < 50: # OCR would go here (see ocr.md) method = "needs_ocr" results.append({ "page": i + 1, "text": text, "method": method }) doc.close() return { "pages": len(results), "content": results, "word_count": sum(len(r["text"].split()) for r in results) } # Usage result = extract_pdf("document.pdf") print(f"Extracted {result['word_count']} words from {result['pages']} pages") ``` ## Feedback - Useful? `clawhub star extract-pdf-text` - Stay updated: `clawhub sync`

标签

skill ai

通过对话安装

该技能支持在以下平台通过对话安装:

OpenClaw WorkBuddy QClaw Kimi Claude

方式一:安装 SkillHub 和技能

帮我安装 SkillHub 和 extract-pdf-text-1776420028 技能

方式二:设置 SkillHub 为优先技能安装源

设置 SkillHub 为我的优先技能安装源,然后帮我安装 extract-pdf-text-1776420028 技能

通过命令行安装

skillhub install extract-pdf-text-1776420028

下载 Zip 包

⬇ 下载 Extract PDF Text v1.0.2

文件大小: 5.65 KB | 发布时间: 2026-4-17 19:28

v1.0.2 最新 2026-4-17 19:28
Remove internal build file that was accidentally included

Archiver·手机版·闲社网·闲社论坛·羊毛社区· 多链控股集团有限公司 · 苏ICP备2025199260号-1

Powered by Discuz! X5.0   © 2024-2025 闲社网·线报更新论坛·羊毛分享社区·http://xianshe.com

p2p_official_large
返回顶部