OCR Benchmark v2.0.0
Multi-model OCR accuracy comparison with fuzzy line-level scoring, cost tracking, and PPT report generation.
Setup
1. Install dependencies
CODEBLOCK0
2. Configure environment variables
Set the variables for the providers you want to use:
CODEBLOCK1
Note on PaddleOCR: This provider requires an external API endpoint.
If PADDLEOCR_ENDPOINT is not set, it is automatically skipped — no error.
If you don't have a PaddleOCR endpoint, simply don't set the env var.
3. Prepare images
Place your images locally (.jpg, .png, .webp). There is no automatic image download — provide local file paths on the command line.
Quick Start
Run benchmark on images
CODEBLOCK2
Skip models with missing credentials (no error, just skips)
CODEBLOCK3
Run only specific models
CODEBLOCK4
Score-only mode (re-score without re-running OCR)
CODEBLOCK5
Generate PPT report from scored results
python3 scripts/make_report.py \
--results-dir ./results \
--images img1.jpg img2.jpg img3.jpg \
--scores ./results/scores.json \
--output report.pptx
Workflow
- 1. Prepare images — collect your
.jpg / .png files locally - Run benchmark —
run_benchmark.py calls each model, saves INLINECODE7 - Create ground truth — see
references/ground-truth-format.md for format - Score — run with
--ground-truth to produce scores.json and a terminal table - Report —
make_report.py generates a shareable INLINECODE12
Environment Variables
| Variable | Provider | Required? | Description |
|---|
| INLINECODE13 | Bedrock | Optional | Default: INLINECODE14 |
| INLINECODE15 |
Gemini |
Yes | Google AI Studio API key |
|
PADDLEOCR_ENDPOINT | PaddleOCR | Optional | Endpoint URL;
auto-skipped if unset |
|
PADDLEOCR_TOKEN | PaddleOCR | Optional | Auth token for PaddleOCR |
Missing variables: If a model's required env var is missing, it is automatically skipped with a warning. Use --auto-skip for completely silent skipping.
Available Models
See references/models.md for full model IDs, pricing, and provider notes.
| Key | Label | Provider |
|---|
| INLINECODE20 | Claude Opus 4.6 | Bedrock |
| INLINECODE21 |
Claude Sonnet 4.6 | Bedrock |
|
haiku | Claude Haiku 4.5 | Bedrock |
|
gemini3pro | Gemini 3.1 Pro | Google AI Studio |
|
gemini3flash | Gemini 3.1 Flash-Lite | Google AI Studio |
|
paddleocr | PaddleOCR | External endpoint |
Scoring Logic (v2)
Scoring uses fuzzy line-level matching with Levenshtein edit distance (pure Python stdlib, no extra dependencies).
For each ground truth line, the best-matching model output line is found and classified:
| Type | Condition | Score |
|---|
| EXACT | Identical after normalization | 1.0 |
| CLOSE |
Edit distance < 20% of length (punctuation/apostrophe diffs) | 0.8 |
|
PARTIAL | Edit distance < 50% of length (real errors but mostly correct) | 0.5 |
|
MISS | No matching line found | 0.0 |
Additionally, EXTRA lines are detected: model output lines that don't correspond to any ground truth line.
Normalization strips: whitespace, apostrophes/quotes (', ', ` `), common punctuation (*, ✓, ,, 、, :, (), 【】 etc.), then lowercases.
### Example terminal output
CODEBLOCK7
---
## Output Files
Each OCR run produces {image}.{model}.json:
CODEBLOCK8
Scoring produces scores.json` with per-image, per-line, per-model results.
Key Findings (2026-03, product packaging)
Human-verified ranking:
- - Gemini 3.1 Pro (98.7%) — Best accuracy, ~$0.006/image
- Claude Opus 4.6 (92.3%) — High accuracy; occasional missed details
- Gemini 3.1 Flash (89.7%) — Best speed/cost ratio, 9.7s
- Claude Sonnet 4.6 (88.5%) — Stable structured output
- PaddleOCR (67.9%) — Free, character errors on packaging
- Claude Haiku 4.5 (42.3%) — Poor Chinese OCR
Lesson: Never assume any model is ground truth. Human verification is essential.
OCR基准测试 v2.0.0
多模型OCR准确率对比,采用模糊行级评分、成本追踪和PPT报告生成。
环境配置
1. 安装依赖
bash
cd ~/.openclaw/workspace/skills/ocr-benchmark/ocr-benchmark
pip install -r requirements.txt
2. 配置环境变量
设置您要使用的服务商变量:
bash
Bedrock(Claude模型)—— 使用您现有的AWS凭证
export AWS_REGION=us-west-2 # 或您偏好的区域
Gemini(Google AI Studio)
export GOOGLE
APIKEY=your
keyhere
PaddleOCR —— 可选,如不可用则跳过
export PADDLEOCR_ENDPOINT=https://your-paddle-endpoint
export PADDLEOCR
TOKEN=yourtoken # 可选的身份验证令牌
关于PaddleOCR的说明: 该服务商需要外部API端点。
如果未设置 PADDLEOCR_ENDPOINT,系统将自动跳过——不会报错。
如果您没有PaddleOCR端点,只需不设置该环境变量即可。
3. 准备图片
将您的图片放在本地(.jpg、.png、.webp)。系统不会自动下载图片——请在命令行中提供本地文件路径。
快速开始
对图片运行基准测试
bash
python3 scripts/run_benchmark.py \
--images img1.jpg img2.jpg img3.jpg \
--output-dir ./results \
--ground-truth ground_truth.json
跳过缺少凭证的模型(不会报错,仅跳过)
bash
python3 scripts/run_benchmark.py \
--images img1.jpg \
--auto-skip \
--output-dir ./results
仅运行特定模型
bash
python3 scripts/run_benchmark.py \
--images img1.jpg \
--models opus sonnet gemini3pro \
--output-dir ./results \
--ground-truth ground_truth.json
仅评分模式(不重新运行OCR,仅重新评分)
bash
python3 scripts/run_benchmark.py \
--score-only \
--output-dir ./results \
--ground-truth ground_truth.json
从评分结果生成PPT报告
bash
python3 scripts/make_report.py \
--results-dir ./results \
--images img1.jpg img2.jpg img3.jpg \
--scores ./results/scores.json \
--output report.pptx
工作流程
- 1. 准备图片 — 在本地收集您的 .jpg / .png 文件
- 运行基准测试 — runbenchmark.py 调用每个模型,保存 {图片}.{模型}.json
- 创建真实标注 — 格式请参见 references/ground-truth-format.md
- 评分 — 使用 --ground-truth 参数运行,生成 scores.json 和终端表格
- 报告 — makereport.py 生成可分享的 .pptx 文件
环境变量
| 变量 | 服务商 | 是否必需 | 描述 |
|---|
| AWSREGION | Bedrock | 可选 | 默认值:us-west-2 |
| GOOGLEAPI_KEY |
Gemini |
是 | Google AI Studio API密钥 |
| PADDLEOCR_ENDPOINT | PaddleOCR | 可选 | 端点URL;
未设置时自动跳过 |
| PADDLEOCR_TOKEN | PaddleOCR | 可选 | PaddleOCR的身份验证令牌 |
缺失变量: 如果某个模型所需的环境变量缺失,系统将自动跳过并发出警告。使用 --auto-skip 可完全静默跳过。
可用模型
完整的模型ID、定价和服务商说明请参见 references/models.md。
| 键名 | 标签 | 服务商 |
|---|
| opus | Claude Opus 4.6 | Bedrock |
| sonnet |
Claude Sonnet 4.6 | Bedrock |
| haiku | Claude Haiku 4.5 | Bedrock |
| gemini3pro | Gemini 3.1 Pro | Google AI Studio |
| gemini3flash | Gemini 3.1 Flash-Lite | Google AI Studio |
| paddleocr | PaddleOCR | 外部端点 |
评分逻辑(v2版)
评分采用模糊行级匹配,基于Levenshtein编辑距离(纯Python标准库实现,无额外依赖)。
对于每行真实标注,找到最佳匹配的模型输出行并进行分类:
| 类型 | 条件 | 分数 |
|---|
| 精确匹配 | 标准化后完全相同 | 1.0 |
| 接近匹配 |
编辑距离小于长度的20%(标点/撇号差异) | 0.8 |
|
部分匹配 | 编辑距离小于长度的50%(存在真实错误但大部分正确) | 0.5 |
|
未匹配 | 未找到匹配行 | 0.0 |
此外,还会检测多余行:模型输出中与任何真实标注行都不对应的行。
标准化处理:去除空白字符、撇号/引号(、、 )、常见标点(*、✓、,、、、:、()、【】等),然后转换为小写。
终端输出示例
========================================================================
OCR基准测试结果
========================================================================
# 模型 得分 详情
🥇 Gemini 3.1 Pro 98.7% Image001: 99% | Image002: 98%
🥈 Claude Opus 4.6 88.3% Image001: 90% | Image002: 87%
🥉 Claude Sonnet 4.6 85.1% Image001: 86% | Image002: 84%
4. Gemini 3.1 Flash-Lite 82.0% ...
========================================================================
📄 Image001
──────────────────────────────────────────────────────────────────────
┌─ Claude Opus 4.6 (90.0%)
│ ✅ 精确匹配 │ 小胡鸭
│ 🟡 接近匹配 │ 真实标注: Sams Coffee
│ │ 识别结果: Sams Coffee [距离=2]
│ 🟠 部分匹配 │ 真实标注: 浓郁香气
│ │ 识别结果: 浓都香气 [距离=1]
│ ❌ 未匹配 │ 真实标注: 净含量580克
│ ⚠️ 多余行 (1行):
│ + Product of China
└──────────────────────────────────────────────────────────────────────
输出文件
每次OCR运行都会生成 {图片}.{模型}.json:
json
{
text_extracted: [line1, line2, ...],
brand: ...,
product_name: ...,
net_weight: ...,
ingredients: [...],
other_fields: {},
model: Claude Opus 4.6,
model_key: opus,
latency_seconds: 23.5,
input_tokens: 800,
output_tokens: 500
}
评分生成 scores.json,包含每张图片、每行、每个模型的评分结果。
主要发现(2026年3月,产品包装测试)
人工验证排名:
- - Gemini 3.1 Pro(98.7%)— 最佳准确率,约$0.006/张
- Claude Opus 4.6(92.3%)— 高准确率;偶尔遗漏细节
- Gemini 3.1 Flash(89.7%)— 最佳速度/成本比,9.7秒
- Claude Sonnet 4.6(88.5%)— 稳定的结构化输出
- PaddleOCR(67.9%)— 免费,包装文字存在字符错误
- Claude Haiku 4.5(42.3%)— 中文OCR效果较差
经验教训: 切勿假设任何模型是真实标注。人工验证至关重要。