Super OCR
Overview
Super OCR is a production-grade optical character recognition tool that intelligently selects the best engine for your needs:
- - Tesseract Engine: Lightweight, fast (~200-500ms), perfect for simple text extraction
- PaddleOCR Engine: High accuracy (98%+), optimized for Chinese, ideal for complex documents
Engine Selection Strategy
Auto Mode (Default)
The skill automatically selects the optimal engine:
| Scenario | Selected Engine | Why |
|---|
| Simple text, English only | Tesseract | Faster, lighter dependency |
| Chinese content, high accuracy needed |
PaddleOCR | Better Chinese support, 98%+ accuracy |
| Low confidence from Tesseract | PaddleOCR (fallback) | Quality assurance |
Force Mode
Users can explicitly choose an engine:
- -
--engine tesseract - Use Tesseract only - INLINECODE1 - Use PaddleOCR only
- INLINECODE2 - Auto-select (default)
Quick Start
Installation
This skill requires the following dependencies:
- - PaddleOCR (for Chinese text recognition - 98%+ accuracy)
- Tesseract (for fast English text recognition)
- OpenCV (for image preprocessing)
Option 1: Install with pip (all-in-one)
CODEBLOCK0
Option 2: Install dependencies manually
macOS:
CODEBLOCK1
Ubuntu/Debian:
CODEBLOCK2
Windows:
CODEBLOCK3
Usage
CODEBLOCK4
Structuring This Skill
This skill uses a capabilities-based structure with multiple execution modes:
- 1. Engine Selection Logic - Intelligent decision making
- OCR Execution - Unified interface for different engines
- Post-processing - Standardized output formatting
- Validation & Fallback - Quality assurance
Core Capabilities
1. Intelligent Engine Selection
The skill includes a decision tree that analyzes:
- - Image characteristics (contrast, text size)
- Language patterns (Chinese character detection)
- User requirements (speed vs accuracy)
See scripts/engine_selector.py for implementation details.
2. Dual Engine Support
Tesseract Engine (scripts/tesseract_ocr.py):
- - Fast preprocessing pipeline
- PSM mode 6 for uniform text blocks
- Confidence scoring per word
- Language detection
PaddleOCR Engine (scripts/paddle_ocr.py):
- - State-of-art? SN (East text detection)
- Crnn recognition with LSTM
- Confidence scores per character
- Table detection support
3. Output Formats
Supports multiple output formats:
| Format | Content | Use Case |
|---|
| Text only | Clean extracted text | Simple search/grep |
| Structured |
Text + positions | Data extraction |
| JSON | Full metadata + confidence | API integration |
| Verbose | Debug info | Quality assurance |
4. Quality Guarantees
- - Confidence thresholds (configurable, default 80%)
- Low-confidence alerts for manual review
- \Fallback processing for failed OCRs
Resources
scripts/
- -
main.py - Main entry point, CLI interface (supports multi-engine) - INLINECODE7 - Auto-install and validation
- INLINECODE8 - Multiple output format support
- INLINECODE9 - OCR engine implementations
-
selector.py - Intelligent engine selection logic
-
tesseract.py - Tesseract engine wrapper
-
paddle.py - PaddleOCR engine wrapper
-
macvision.py - macOS Vision OCR (macOS only)
- -
preprocessing/ - Image preprocessing utilities
-
preprocessor.py - Denoising, enhancement, binarization
dependencies.py (Key Feature)
The dependencies.py module handles:
- - Dependency detection (
paddleocr, paddlepaddle, pytesseract, cv2) - Auto-install on missing dependencies
- version checking
- OS-specific installation commands
- Clear error messages with troubleshooting steps
Use this when setting up a new environment with INLINECODE21
Advanced Features
Custom Configuration
Create config.yaml for persistent settings:
CODEBLOCK5
Batch Processing
Process multiple images:
CODEBLOCK6
API Mode
Use as a Python library:
CODEBLOCK7
Anti-Patterns
- - ❌ Using PaddleOCR for every image (overhead for simple cases)
- ❌ ignoring confidence scores (quality matters)
- ❌ Biases (always prefering one engine)
- ❌ Skipping preprocessing (quality impact)
Performance Notes
| Engine | Init Time | Per-Image | Memory | Best For |
|---|
| Tesseract | ~200ms | ~50ms | ~100MB | Quick extraction |
| PaddleOCR |
~3s | ~500ms | ~500MB | High accuracy |
Initialize once, reuse processor for batch processing.
Super OCR
概述
Super OCR 是一款生产级光学字符识别工具,能够智能选择最适合您需求的引擎:
- - Tesseract 引擎:轻量、快速(约200-500ms),适合简单文本提取
- PaddleOCR 引擎:高精度(98%以上),针对中文优化,适合复杂文档
引擎选择策略
自动模式(默认)
该技能自动选择最优引擎:
| 场景 | 选择引擎 | 原因 |
|---|
| 简单文本,仅英文 | Tesseract | 更快,依赖更轻量 |
| 中文内容,需要高精度 |
PaddleOCR | 更好的中文支持,98%以上精度 |
| Tesseract 置信度低 | PaddleOCR(回退) | 质量保证 |
强制模式
用户可以明确选择引擎:
- - --engine tesseract - 仅使用 Tesseract
- --engine paddle - 仅使用 PaddleOCR
- --engine auto - 自动选择(默认)
快速开始
安装
此技能需要以下依赖:
- - PaddleOCR(用于中文文本识别 - 98%以上精度)
- Tesseract(用于快速英文文本识别)
- OpenCV(用于图像预处理)
选项 1:使用 pip 安装(一体化)
bash
pip install paddleocr paddlepaddle pytesseract pillow opencv-python numpy
选项 2:手动安装依赖
macOS:
bash
Tesseract
brew install tesseract
PaddleOCR
pip install paddleocr paddlepaddle
Ubuntu/Debian:
bash
Tesseract
sudo apt update && sudo apt install tesseract-ocr
PaddleOCR
pip install paddleocr paddlepaddle
Windows:
bash
从 https://github.com/UB-Mannheim/tesseract/wiki 下载 Tesseract
pip install paddleocr paddlepaddle pytesseract pillow opencv-python numpy
使用方法
bash
自动模式(推荐)- 运行所有可用引擎
cd path/to/super-ocr
python scripts/main.py --image path/to/image.png
强制仅使用 Tesseract
python scripts/main.py --image document.jpg --engine tesseract
强制使用 PaddleOCR(高精度中文)
python scripts/main.py --image chinese_menu.png --engine paddle
运行所有引擎(仅 macOS:Tesseract + PaddleOCR + MacVision)
python scripts/main.py --image complex_doc.png --engine all
批量处理并指定输出目录
python scripts/main.py --images ./images/*.png --output ./results --verbose
检查依赖并自动安装
python scripts/dependencies.py --check --install
技能结构
此技能采用基于能力的结构,支持多种执行模式:
- 1. 引擎选择逻辑 - 智能决策
- OCR 执行 - 不同引擎的统一接口
- 后处理 - 标准化输出格式
- 验证与回退 - 质量保证
核心能力
1. 智能引擎选择
该技能包含一个决策树,分析以下内容:
- - 图像特征(对比度、文本大小)
- 语言模式(中文字符检测)
- 用户需求(速度与精度)
详见 scripts/engine_selector.py 实现。
2. 双引擎支持
Tesseract 引擎(scripts/tesseract_ocr.py):
- - 快速预处理流程
- PSM 模式 6 用于统一文本块
- 每个单词的置信度评分
- 语言检测
PaddleOCR 引擎(scripts/paddle_ocr.py):
- - 最先进的 EAST 文本检测
- 带 LSTM 的 CRNN 识别
- 每个字符的置信度评分
- 表格检测支持
3. 输出格式
支持多种输出格式:
| 格式 | 内容 | 使用场景 |
|---|
| 纯文本 | 干净的提取文本 | 简单搜索/文本处理 |
| 结构化 |
文本 + 位置 | 数据提取 |
| JSON | 完整元数据 + 置信度 | API 集成 |
| 详细模式 | 调试信息 | 质量保证 |
4. 质量保证
- - 置信度阈值(可配置,默认80%)
- 低置信度警报,提示人工审核
- OCR 失败时的回退处理
资源
scripts/
- - main.py - 主入口,CLI 接口(支持多引擎)
- dependencies.py - 自动安装和验证
- output_formatter.py - 多种输出格式支持
- engine/ - OCR 引擎实现
- selector.py - 智能引擎选择逻辑
- tesseract.py - Tesseract 引擎封装
- paddle.py - PaddleOCR 引擎封装
- macvision.py - macOS Vision OCR(仅 macOS)
- - preprocessing/ - 图像预处理工具
- preprocessor.py - 去噪、增强、二值化
dependencies.py(关键功能)
dependencies.py 模块处理:
- - 依赖检测(paddleocr、paddlepaddle、pytesseract、cv2)
- 缺失依赖自动安装
- 版本检查
- 操作系统特定安装命令
- 清晰的错误信息和故障排除步骤
在新环境中使用 python scripts/dependencies.py --check --install 进行设置
高级功能
自定义配置
创建 config.yaml 进行持久化设置:
yaml
default_engine: auto
confidence_threshold: 0.8
output_format: json
preprocess:
denoise: true
enhance_contrast: true
批量处理
处理多个图像:
bash
python scripts/ocr.py --images ./images/*.png --output ./results
API 模式
作为 Python 库使用:
python
from super_ocr import OCRProcessor
processor = OCRProcessor(engine=auto)
result = processor.extract(image.png)
print(result.text)
print(result.confidence)
反模式
- - ❌ 对每个图像都使用 PaddleOCR(简单场景下开销过大)
- ❌ 忽略置信度评分(质量很重要)
- ❌ 偏见(总是偏好某个引擎)
- ❌ 跳过预处理(影响质量)
性能说明
| 引擎 | 初始化时间 | 每张图像 | 内存 | 最佳用途 |
|---|
| Tesseract | ~200ms | ~50ms | ~100MB | 快速提取 |
| PaddleOCR |
~3s | ~500ms | ~500MB | 高精度 |
初始化一次,在批量处理中复用处理器。