返回顶部
p

pdf-ocr-extractor

Extract text from image-based or scanned PDFs using Tesseract OCR.

作者: admin | 来源: ClawHub
源自
ClawHub
版本
V 1.0.3
安全检测
已通过
370
下载量
1
收藏
概述
安装方式
版本历史

pdf-ocr-extractor

# PDF OCR Extractor Use this skill to extract text from scanned PDFs or image-based PDFs that lack a native text layer. It's completely free, doesn't utilize third-party APIs, and offers unlimited usage. It renders PDF pages to images and runs optical character recognition (OCR). ## Dependencies This skill requires: 1. **System Binary**: `tesseract` (along with required language data packs like `chi_sim` or `eng`). 2. **Python Packages**: `pypdfium2`, `pytesseract`, and `Pillow`. *Note: Do not run automated `pip install` commands at runtime. Rely on the user or the environment to pre-install the dependencies defined in the metadata block.* ## Quick Start Create a Python script (e.g., `extract.py`) in a temporary directory to handle the extraction safely: ```python import pypdfium2 as pdfium import pytesseract from PIL import Image import sys import os def extract(pdf_path): doc = pdfium.PdfDocument(pdf_path) full_text = [] for i, page in enumerate(doc): # Render page to a high-resolution image bitmap = page.render(scale=2) tmp_img = f"/tmp/page_{i}.png" bitmap.to_pil().save(tmp_img) # Run OCR (assuming English and Simplified Chinese packs are installed) text = pytesseract.image_to_string(Image.open(tmp_img), lang='chi_sim+eng') full_text.append(text) # Cleanup temporary file os.remove(tmp_img) return "\n".join(full_text) if __name__ == "__main__": if len(sys.argv) > 1: print(extract(sys.argv[1])) ``` Then execute the script: ```bash python3 extract.py /path/to/document.pdf ``` ## Security & Sandbox Constraints - Write temporary images only to `/tmp/` and clean them up immediately after extraction. - Do not attempt to dynamically download or install language packs via shell commands; notify the user if a specific language is missing.

标签

skill ai

通过对话安装

该技能支持在以下平台通过对话安装:

OpenClaw WorkBuddy QClaw Kimi Claude

方式一:安装 SkillHub 和技能

帮我安装 SkillHub 和 pdf-ocr-extraction-1776105248 技能

方式二:设置 SkillHub 为优先技能安装源

设置 SkillHub 为我的优先技能安装源,然后帮我安装 pdf-ocr-extraction-1776105248 技能

通过命令行安装

skillhub install pdf-ocr-extraction-1776105248

下载 Zip 包

⬇ 下载 pdf-ocr-extractor v1.0.3

文件大小: 1.88 KB | 发布时间: 2026-4-17 15:43

v1.0.3 最新 2026-4-17 15:43
**Minor update for clarity and metadata improvements.**

- Improved documentation with clear separation of dependencies and quick start instructions.
- Added detailed metadata block defining required binaries and installation steps for both system and Python dependencies.
- Updated guidance to avoid running automated installations at runtime; users/environment must pre-install prerequisites.
- Enhanced security section: instructs storing temporary images only in `/tmp/` and immediate cleanup.
- Provided a full, copy-pasteable Python extraction script for better usability.

Archiver·手机版·闲社网·闲社论坛·羊毛社区· 多链控股集团有限公司 · 苏ICP备2025199260号-1

Powered by Discuz! X5.0   © 2024-2025 闲社网·线报更新论坛·羊毛分享社区·http://xianshe.com

p2p_official_large
返回顶部