OCR Benchmark v2.0.0

Multi-model OCR accuracy comparison with fuzzy line-level scoring, cost tracking, and PPT report generation.

Setup

1. Install dependencies

CODEBLOCK0

2. Configure environment variables

Set the variables for the providers you want to use:

CODEBLOCK1

Note on PaddleOCR: This provider requires an external API endpoint.
If PADDLEOCR_ENDPOINT is not set, it is automatically skipped — no error.
If you don't have a PaddleOCR endpoint, simply don't set the env var.

3. Prepare images

Place your images locally (.jpg, .png, .webp). There is no automatic image download — provide local file paths on the command line.

Quick Start

Run benchmark on images

CODEBLOCK2

Skip models with missing credentials (no error, just skips)

CODEBLOCK3

Run only specific models

CODEBLOCK4

Score-only mode (re-score without re-running OCR)

CODEBLOCK5

Generate PPT report from scored results

python3 scripts/make_report.py \
  --results-dir ./results \
  --images img1.jpg img2.jpg img3.jpg \
  --scores ./results/scores.json \
  --output report.pptx

Workflow

1. Prepare images — collect your .jpg / .png files locally
Run benchmark — run_benchmark.py calls each model, saves INLINECODE7
Create ground truth — see references/ground-truth-format.md for format
Score — run with --ground-truth to produce scores.json and a terminal table
Report — make_report.py generates a shareable INLINECODE12

Environment Variables

Variable	Provider	Required?	Description
INLINECODE13	Bedrock	Optional	Default: INLINECODE14
INLINECODE15

Missing variables: If a model's required env var is missing, it is automatically skipped with a warning. Use --auto-skip for completely silent skipping.

Available Models

See references/models.md for full model IDs, pricing, and provider notes.

Key	Label	Provider
INLINECODE20	Claude Opus 4.6	Bedrock
INLINECODE21

Scoring Logic (v2)

Scoring uses fuzzy line-level matching with Levenshtein edit distance (pure Python stdlib, no extra dependencies).

For each ground truth line, the best-matching model output line is found and classified:

Type	Condition	Score
EXACT	Identical after normalization	1.0
CLOSE

Additionally, EXTRA lines are detected: model output lines that don't correspond to any ground truth line.

Normalization strips: whitespace, apostrophes/quotes (', ', ` `), common punctuation (*, ✓, ，, 、, ：, （）, 【】etc.), then lowercases. ### Example terminal output CODEBLOCK7 --- ## Output Files Each OCR run produces{image}.{model}.json: CODEBLOCK8 Scoring producesscores.json` with per-image, per-line, per-model results.

Key Findings (2026-03, product packaging)

Human-verified ranking:

- Gemini 3.1 Pro (98.7%) — Best accuracy, ~$0.006/image
Claude Opus 4.6 (92.3%) — High accuracy; occasional missed details
Gemini 3.1 Flash (89.7%) — Best speed/cost ratio, 9.7s
Claude Sonnet 4.6 (88.5%) — Stable structured output
PaddleOCR (67.9%) — Free, character errors on packaging
Claude Haiku 4.5 (42.3%) — Poor Chinese OCR

Lesson: Never assume any model is ground truth. Human verification is essential.

OCR基准测试 v2.0.0

多模型OCR准确率对比，采用模糊行级评分、成本追踪和PPT报告生成。

环境配置

1. 安装依赖

bash cd ~/.openclaw/workspace/skills/ocr-benchmark/ocr-benchmark pip install -r requirements.txt

2. 配置环境变量

设置您要使用的服务商变量：

bash

Bedrock（Claude模型）—— 使用您现有的AWS凭证

export AWS_REGION=us-west-2 # 或您偏好的区域

Gemini（Google AI Studio）

export GOOGLEAPIKEY=yourkeyhere

PaddleOCR —— 可选，如不可用则跳过

export PADDLEOCR_ENDPOINT=https://your-paddle-endpoint export PADDLEOCRTOKEN=yourtoken # 可选的身份验证令牌

关于PaddleOCR的说明： 该服务商需要外部API端点。
如果未设置 PADDLEOCR_ENDPOINT，系统将自动跳过——不会报错。
如果您没有PaddleOCR端点，只需不设置该环境变量即可。

3. 准备图片

将您的图片放在本地（.jpg、.png、.webp）。系统不会自动下载图片——请在命令行中提供本地文件路径。

快速开始

对图片运行基准测试

bash python3 scripts/run_benchmark.py \ --images img1.jpg img2.jpg img3.jpg \ --output-dir ./results \ --ground-truth ground_truth.json

跳过缺少凭证的模型（不会报错，仅跳过）

bash python3 scripts/run_benchmark.py \ --images img1.jpg \ --auto-skip \ --output-dir ./results

仅运行特定模型

bash python3 scripts/run_benchmark.py \ --images img1.jpg \ --models opus sonnet gemini3pro \ --output-dir ./results \ --ground-truth ground_truth.json

仅评分模式（不重新运行OCR，仅重新评分）

bash python3 scripts/run_benchmark.py \ --score-only \ --output-dir ./results \ --ground-truth ground_truth.json

从评分结果生成PPT报告

bash python3 scripts/make_report.py \ --results-dir ./results \ --images img1.jpg img2.jpg img3.jpg \ --scores ./results/scores.json \ --output report.pptx

工作流程

1. 准备图片 — 在本地收集您的 .jpg / .png 文件
运行基准测试 — runbenchmark.py 调用每个模型，保存 {图片}.{模型}.json
创建真实标注 — 格式请参见 references/ground-truth-format.md
评分 — 使用 --ground-truth 参数运行，生成 scores.json 和终端表格
报告 — makereport.py 生成可分享的 .pptx 文件

环境变量

变量	服务商	是否必需	描述
AWSREGION	Bedrock	可选	默认值：us-west-2
GOOGLEAPI_KEY

缺失变量： 如果某个模型所需的环境变量缺失，系统将自动跳过并发出警告。使用 --auto-skip 可完全静默跳过。

可用模型

完整的模型ID、定价和服务商说明请参见 references/models.md。

键名	标签	服务商
opus	Claude Opus 4.6	Bedrock
sonnet

评分逻辑（v2版）

评分采用模糊行级匹配，基于Levenshtein编辑距离（纯Python标准库实现，无额外依赖）。

对于每行真实标注，找到最佳匹配的模型输出行并进行分类：

类型	条件	分数
精确匹配	标准化后完全相同	1.0
接近匹配

编辑距离小于长度的20%（标点/撇号差异） | 0.8 |
| 部分匹配 | 编辑距离小于长度的50%（存在真实错误但大部分正确） | 0.5 |
| 未匹配 | 未找到匹配行 | 0.0 |

此外，还会检测多余行：模型输出中与任何真实标注行都不对应的行。

标准化处理：去除空白字符、撇号/引号（、、）、常见标点（*、✓、，、、、：、（）、【】等），然后转换为小写。

终端输出示例

========================================================================
OCR基准测试结果
========================================================================
# 模型得分详情

🥇 Gemini 3.1 Pro 98.7% Image001: 99% | Image002: 98%
🥈 Claude Opus 4.6 88.3% Image001: 90% | Image002: 87%
🥉 Claude Sonnet 4.6 85.1% Image001: 86% | Image002: 84%
4. Gemini 3.1 Flash-Lite 82.0% ...
========================================================================

📄 Image001
──────────────────────────────────────────────────────────────────────
┌─ Claude Opus 4.6 (90.0%)
│ ✅ 精确匹配 │ 小胡鸭
│ 🟡 接近匹配 │ 真实标注: Sams Coffee
│ │ 识别结果: Sams Coffee [距离=2]
│ 🟠 部分匹配 │ 真实标注: 浓郁香气
│ │ 识别结果: 浓都香气 [距离=1]
│ ❌ 未匹配 │ 真实标注: 净含量580克
│ ⚠️ 多余行 (1行):
│ + Product of China
└──────────────────────────────────────────────────────────────────────

输出文件

每次OCR运行都会生成 {图片}.{模型}.json：
json
{
text_extracted: [line1, line2, ...],
brand: ...,
product_name: ...,
net_weight: ...,
ingredients: [...],
other_fields: {},
model: Claude Opus 4.6,
model_key: opus,
latency_seconds: 23.5,
input_tokens: 800,
output_tokens: 500
}

评分生成 scores.json，包含每张图片、每行、每个模型的评分结果。

主要发现（2026年3月，产品包装测试）

人工验证排名：

- Gemini 3.1 Pro（98.7%）— 最佳准确率，约$0.006/张
Claude Opus 4.6（92.3%）— 高准确率；偶尔遗漏细节
Gemini 3.1 Flash（89.7%）— 最佳速度/成本比，9.7秒
Claude Sonnet 4.6（88.5%）— 稳定的结构化输出
PaddleOCR（67.9%）— 免费，包装文字存在字符错误
Claude Haiku 4.5（42.3%）— 中文OCR效果较差

经验教训： 切勿假设任何模型是真实标注。人工验证至关重要。

ocr-benchmarkOCR基准工具

ocr-benchmark

OCR Benchmark v2.0.0

Setup

1. Install dependencies

2. Configure environment variables

3. Prepare images

Quick Start

Run benchmark on images

Skip models with missing credentials (no error, just skips)

Run only specific models

Score-only mode (re-score without re-running OCR)

Generate PPT report from scored results

Workflow

Environment Variables

Available Models

Scoring Logic (v2)

Key Findings (2026-03, product packaging)

OCR基准测试 v2.0.0

环境配置

1. 安装依赖

2. 配置环境变量

Bedrock（Claude模型）—— 使用您现有的AWS凭证

Gemini（Google AI Studio）

PaddleOCR —— 可选，如不可用则跳过

3. 准备图片

快速开始

对图片运行基准测试

跳过缺少凭证的模型（不会报错，仅跳过）

仅运行特定模型

仅评分模式（不重新运行OCR，仅重新评分）

从评分结果生成PPT报告

工作流程

环境变量

可用模型

评分逻辑（v2版）

终端输出示例

输出文件

主要发现（2026年3月，产品包装测试）

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement