PaddleOCR Text Recognition Skill

When to Use This Skill

Trigger keywords (routing): Bilingual trigger terms (Chinese and English) are listed in the YAML description above—use that field for discovery and routing.

Use this skill for:

- Extract text from images (screenshots, photos, scans)
Extract text from PDFs or document images when the goal is line/box-level text, not recovering table grids, formulas, or full reading-order layout
Extract text from URLs or local files that point to images/PDFs

Do not use for:

- Plain text files, code files, or markdown documents that can be read directly as text
Documents with tables, formulas, charts, or complex layouts — use Document Parsing instead
Tasks that do not involve image-to-text conversion

Installation

Scripts declare their dependencies inline (PEP 723). No separate install step is needed — uv resolves dependencies automatically:

CODEBLOCK0

How to Use This Skill

Working directory: All uv run scripts/... commands below should be run from this skill's root directory (the directory containing this SKILL.md file).

Basic Workflow

1. Identify the input source:

- User provides URL: Use the --file-url parameter - User provides local file path: Use the --file-path parameter

2. Execute OCR:

CODEBLOCK1

Or for local files:

CODEBLOCK2

> Performance note: Parsing time scales with document complexity. Single-page images typically complete in 1-3 seconds; large PDFs (50+ pages) may take several minutes. Allow adequate time before assuming a timeout.

Default behavior: save raw JSON to a temp file:
- If --output is omitted, the script saves automatically under the system temp directory
- Default path pattern: <system-temp>/paddleocr/text-recognition/results/result_<timestamp>_<id>.json
- If --output is provided, it overrides the default temp-file destination
- If --stdout is provided, JSON is printed to stdout and no file is saved
- In save mode, the script prints the absolute saved path on stderr: Result saved to: /absolute/path/...
- In default/custom save mode, read and parse the saved JSON file before responding
- Use --stdout only when you explicitly want to skip file persistence

3. Parse JSON response:

- In default/custom save mode, load JSON from the saved file path shown by the script - Check the ok field: true means success, false means error - Extract text: text field contains all recognized text - If --stdout is used, parse the stdout JSON directly - Handle errors: If ok is false, display INLINECODE16

4. Present results to user:

- Display extracted text in a readable format - If the text is empty, the image may contain no text - In save mode, always tell the user the saved file path and that full raw JSON is available there

What to Do After Extraction

Common next steps once you have the recognized text:

- Save to file: Write the text field to a .txt or .md file
Search the content: Search the saved output file for keywords
Feed to another pipeline: The text field is clean plain text, ready for downstream processing
Poor results: See "Tips for Better Results" below before retrying

Complete Output Display

Always display the COMPLETE recognized text to the user. The user typically needs the full content for downstream use — truncation silently loses data they may not notice is missing.

- Display the entire text field, no matter how long
Do not use phrases like "Here's a summary" or "The text begins with..."
Do not truncate with "..." unless the text truly exceeds reasonable display limits (>10,000 chars)

Example - Correct:

CODEBLOCK3

Example - Incorrect:

CODEBLOCK4

Understanding the Output

The script returns a JSON envelope with ok, text, result, and error fields. Use text for the recognized content; result contains the raw API response for debugging.

For the full schema and field-level details, see references/output_schema.md.

Raw result location (default): the temp-file path printed by the script on stderr

Usage Examples

Example 1: URL OCR

CODEBLOCK5

Example 2: Local File OCR

CODEBLOCK6

Example 3: OCR With Explicit File Type

CODEBLOCK7

- --file-type 0: PDF
INLINECODE30: image
If omitted, the type is auto-detected from the file extension. For local files, a recognized extension (.pdf, .png, .jpg, .jpeg, .bmp, .tiff, .tif, .webp) is required; otherwise pass --file-type explicitly. For URLs with unrecognized extensions, the service attempts inference.

Example 4: Print JSON Without Saving

CODEBLOCK8

First-Time Configuration

When API is not configured, the script outputs:

CODEBLOCK9

Configuration workflow:

1. Show the exact error message to the user.

2. Guide the user to obtain credentials: Visit the PaddleOCR website, click API, select the PP-OCRv5 model, select the language, then copy the API_URL and Token. They map to these environment variables:

- PADDLEOCR_OCR_API_URL — full endpoint URL ending with /ocr - PADDLEOCR_ACCESS_TOKEN — 40-character alphanumeric string

Optionally configure PADDLEOCR_OCR_TIMEOUT for request timeout. Recommend using the host application's standard configuration method rather than pasting credentials in chat.

3. Apply credentials — one of:

- User configured via the host UI: ask the user to confirm, then retry. - User pastes credentials in chat: warn that they may be stored in conversation history, help the user persist them using the host's standard configuration method, then retry.

Error Handling

All errors return JSON with ok: false. Show the error message and stop — do not fall back to your own vision capabilities. Identify the issue from error.code and error.message:

Authentication failed (403) — error.message contains "Authentication failed"

- Token is invalid, reconfigure with correct credentials

Quota exceeded (429) — error.message contains "API rate limit exceeded"

- Daily API quota exhausted, inform user to wait or upgrade

Unsupported format — error.message contains "Unsupported file format"

- File format not supported, convert to PDF/PNG/JPG

No text detected:

- text field is empty
Image may be blank, corrupted, or contain no text

Tips for Better Results

If recognition quality is poor:

- Low resolution: Provide a higher resolution image (≥300 DPI works well for most printed text)
Noisy background: A cleaner scan or screenshot typically yields better results than a phone photo
Check confidence: The raw JSON (result.result.ocrResults[n].prunedResult.rec_scores) shows per-line confidence scores — low values identify uncertain regions worth reviewing

Reference Documentation

- references/output_schema.md — Full output schema, field descriptions, and command examples

Note: Model version, capabilities, and supported file formats are determined by your API endpoint (PADDLEOCR_OCR_API_URL) and its official API documentation.

Testing the Skill

To verify the skill is working properly:

CODEBLOCK10

The first form tests configuration and API connectivity. --skip-api-test checks configuration only. --test-url overrides the default sample image URL.

PaddleOCR 文本识别技能

何时使用此技能

触发关键词（路由）：中英文双语触发词已列在 YAML 的 description 字段中——使用该字段进行发现和路由。

使用此技能的场景：

- 从图片中提取文本（截图、照片、扫描件）
从 PDF 或文档图片中提取文本，目标是行级/框级文本，而非恢复表格网格、公式或完整阅读顺序布局
从指向图片/PDF 的 URL 或本地文件中提取文本

不适用于：

- 可直接作为文本读取的纯文本文件、代码文件或 Markdown 文档
包含表格、公式、图表或复杂布局的文档——请使用文档解析功能
不涉及图片转文本的任务

安装

脚本内联声明依赖项（PEP 723）。无需单独安装步骤——uv 会自动解析依赖：

bash
uv run scripts/ocr_caller.py --help

如何使用此技能

工作目录：以下所有 uv run scripts/... 命令均需在此技能根目录（包含此 SKILL.md 文件的目录）下运行。

基本工作流程

1. 识别输入来源：

- 用户提供 URL：使用 --file-url 参数 - 用户提供本地文件路径：使用 --file-path 参数

2. 执行 OCR：

bash
uv run scripts/ocr_caller.py --file-url 用户提供的URL --pretty

或针对本地文件：

bash
uv run scripts/ocr_caller.py --file-path 文件路径 --pretty

> 性能说明：解析时间随文档复杂度增加。单页图片通常在 1-3 秒内完成；大型 PDF（50 页以上）可能需要几分钟。请预留充足时间，避免误判为超时。

默认行为：将原始 JSON 保存到临时文件：
- 如果省略 --output，脚本会自动保存到系统临时目录下
- 默认路径模式：<系统临时目录>/paddleocr/text-recognition/results/result<时间戳>.json
- 如果提供 --output，则覆盖默认临时文件路径
- 如果提供 --stdout，JSON 将输出到标准输出，不保存文件
- 在保存模式下，脚本会在标准错误输出中打印绝对保存路径：Result saved to: /绝对路径/...
- 在默认/自定义保存模式下，在响应前读取并解析已保存的 JSON 文件
- 仅在明确需要跳过文件持久化时使用 --stdout

3. 解析 JSON 响应：

- 在默认/自定义保存模式下，从脚本显示的保存文件路径加载 JSON - 检查 ok 字段：true 表示成功，false 表示错误 - 提取文本：text 字段包含所有识别出的文本 - 如果使用 --stdout，直接解析标准输出的 JSON - 处理错误：如果 ok 为 false，显示 error.message

4. 向用户呈现结果：

- 以可读格式显示提取的文本 - 如果文本为空，图片可能不含文本 - 在保存模式下，始终告知用户保存的文件路径以及完整的原始 JSON 可在该处获取

提取后的常见操作

获取识别文本后的常见后续步骤：

- 保存到文件：将 text 字段写入 .txt 或 .md 文件
搜索内容：在保存的输出文件中搜索关键词
输入其他流程：text 字段为干净的纯文本，可直接用于下游处理
结果不佳：重试前请参阅下方的优化结果技巧

完整输出显示

始终向用户显示完整的识别文本。用户通常需要完整内容用于下游使用——截断可能会悄悄丢失用户可能未注意到的数据。

- 显示整个 text 字段，无论多长
不要使用以下是摘要或文本以...等表述
不要用...截断，除非文本确实超出合理显示限制（>10,000 字符）

示例 - 正确：

用户：从这张图片中提取文本
助手：我已从图片中提取文本。以下是完整内容：

[在此处显示完整文本]

示例 - 错误：

用户：从这张图片中提取文本
助手：我在图片中找到了一些文本。以下是预览：
The quick brown fox...（已截断）

理解输出

脚本返回一个包含 ok、text、result 和 error 字段的 JSON 信封。使用 text 获取识别内容；result 包含用于调试的原始 API 响应。

有关完整模式和字段级详细信息，请参阅 references/output_schema.md。

原始结果位置（默认）：脚本在标准错误输出中打印的临时文件路径

使用示例

示例 1：URL OCR

bash
uv run scripts/ocr_caller.py --file-url https://example.com/invoice.jpg --pretty

示例 2：本地文件 OCR

bash
uv run scripts/ocr_caller.py --file-path ./document.pdf --pretty

示例 3：指定文件类型的 OCR

bash
uv run scripts/ocr_caller.py --file-url https://example.com/input --file-type 1 --pretty

- --file-type 0：PDF
--file-type 1：图片
如果省略，类型将从文件扩展名自动检测。对于本地文件，需要可识别的扩展名（.pdf、.png、.jpg、.jpeg、.bmp、.tiff、.tif、.webp）；否则请显式传递 --file-type。对于扩展名不可识别的 URL，服务会尝试推断。

示例 4：打印 JSON 而不保存

bash
uv run scripts/ocr_caller.py --file-url https://example.com/input --stdout --pretty

首次配置

当 API 未配置时，脚本输出：

json
{
ok: false,
text: ,
result: null,
error: {
code: CONFIG_ERROR,
message: PADDLEOCROCRAPI_URL 未配置。请访问 https://paddleocr.com 获取您的 API
}
}

配置流程：

1. 向用户显示确切的错误信息。

2. 引导用户获取凭证：访问 PaddleOCR 网站，点击 API，选择 PP-OCRv5 模型，选择语言，然后复制 APIURL 和 Token。它们对应以下环境变量：

- PADDLEOCROCRAPI_URL — 以 /ocr 结尾的完整端点 URL - PADDLEOCRACCESSTOKEN — 40 字符字母数字字符串

可选配置 PADDLEOCROCRTIMEOUT 用于请求超时。建议使用宿主应用的标准配置方法，而非在聊天中粘贴凭证。

3. 应用凭证——以下之一：

- 用户通过宿主 UI 配置：请用户确认，然后重试。 - 用户在聊天中粘贴凭证：警告凭证可能存储在聊天记录中，帮助用户使用宿主的标准配置方法持久化凭证，然后重试。

错误处理

所有错误均返回 ok: false 的 JSON。显示错误信息并停止——不要回退到您自己的视觉能力。通过 error.code 和 error.message 识别问题：

认证失败（403） — error.message 包含 Authentication failed

- Token 无效，使用正确凭证重新配置

配额超限（429） — error.message 包含 API rate limit exceeded

- 每日 API 配额已用尽，告知用户等待或升级

不支持的格式 — error.message 包含 Unsupported file format

- 文件格式不受支持，转换为 PDF/PNG/JPG

未检测到文本：

- text 字段为空
图片可能为空白、损坏或不含文本

优化结果技巧

如果识别质量不佳：

- 低分辨率：提供更高分辨率的图片（≥300 DPI 对大多数印刷文本效果良好）
背景嘈杂：更清晰的扫描件或截图通常比手机照片效果更好
检查置信度：原始 JSON（result.result.ocrResults[n].prunedResult.rec_scores）显示每行置信度分数——低值标识需要复核的不确定区域

参考文档

- references/output_schema.md — 完整输出模式、字段描述和命令

paddleocr-text-recognitionPaddleOCR文字识别