PDF to OFD High-Fidelity Converter
🎯 Purpose
A specialized skill for converting PDF documents into the Chinese National Standard
OFD (GB/T 33190-2016) format. Optimized for
Electronic Invoices (OFD版式发票) with advanced rendering capabilities that exceed standard conversion libraries.
✨ Key Features
- - High-Fidelity Text Placement: Uses character-level positioning (
DeltaX arrays) and baseline origin data extracted via rawdict to ensure text layout is 100% identical to the source PDF. - Advanced Vector Graphics: Directly extracts original stroke colors, fill colors, and line widths. Supports complex path types and fill instructions.
- Transparency Preservation: Fully supports
Alpha and FillOpacity for vector paths and SMask transparency for images (e.g., electronic seals and signatures). - Cross-Platform Font Mapping: Intelligent mapping of macOS-specific (STSong, STKaiti) and Windows-specific font names to standardized OFD font names (宋体, 楷体, 黑体).
- In-Memory Packaging: Generates the final OFD zip structure entirely in memory to avoid temporary file clutter and ensure security.
- Color Snapping: Heuristic "Invoice Red" correction (
128 0 0) for financial documents while preserving non-standard colors.
🛠️ Usage Instructions
When a user asks to convert a PDF or a "High-Fidelity" invoice to OFD:
- 1. Direct Execution:
CODEBLOCK0
- 2. Plugin Integration:
The script implements a
PDF2OFDConverter class that can be easily imported and used in other Python workflows.
Example Output
CODEBLOCK1
📦 Requirements
Dependencies required in the environment:
- -
PyMuPDF (fitz): For advanced PDF parsing and raw character data extraction. - INLINECODE8 : For image processing and transparency handling.
- INLINECODE9 : The base library for OFD structure (extended via internal monkey patches).
- INLINECODE10 : For XML manipulation.
💡 Notes
- - This skill uses deep monkey-patching on
easyofd to fix known library limitations regarding character positioning and resource ID tracking. - The conversion process assumes standard Chinese fonts (SimSun, KaiTi, SimHei) are available on the viewing system.
- Zero-copy resource handling: Images are extracted and re-compressed as PNG/JPG only when necessary to preserve quality.
PDF到OFD高保真转换器
🎯 目的
一个专门用于将PDF文档转换为中国国家标准
OFD(GB/T 33090-2016)格式的技能。针对
电子发票(OFD版式发票)进行了优化,具备超越标准转换库的高级渲染能力。
✨ 主要特性
- - 高保真文本定位:使用基于字符级别的定位(DeltaX数组)和通过rawdict提取的基线原点数据,确保文本布局与源PDF完全一致。
- 高级矢量图形:直接提取原始描边颜色、填充颜色和线宽。支持复杂的路径类型和填充指令。
- 透明度保留:完全支持矢量路径的Alpha和FillOpacity属性,以及图像的SMask透明度(例如电子印章和签名)。
- 跨平台字体映射:智能映射macOS特有字体(STSong、STKaiti)和Windows特有字体名称到标准OFD字体名称(宋体、楷体、黑体)。
- 内存打包:完全在内存中生成最终的OFD压缩包结构,避免临时文件混乱并确保安全性。
- 颜色捕捉:针对财务文档的启发式“发票红”校正(128 0 0),同时保留非标准颜色。
🛠️ 使用说明
当用户要求将PDF或“高保真”发票转换为OFD时:
- 1. 直接执行:
bash
python3 pdf2ofd.py <输入路径.pdf> [输出路径.ofd]
- 2. 插件集成:
该脚本实现了一个PDF2OFDConverter类,可以轻松导入并在其他Python工作流中使用。
输出示例
text
成功:/路径/到/发票.ofd
📦 依赖要求
环境中所需的依赖项:
- - PyMuPDF (fitz):用于高级PDF解析和原始字符数据提取。
- Pillow:用于图像处理和透明度处理。
- easyofd:OFD结构的基础库(通过内部猴子补丁进行扩展)。
- xmltodict:用于XML操作。
💡 注意事项
- - 此技能对easyofd进行了深度猴子补丁,以修复已知的库在字符定位和资源ID跟踪方面的限制。
- 转换过程假设查看系统上存在标准中文字体(SimSun、KaiTi、SimHei)。
- 零拷贝资源处理:仅在必要时将图像提取并重新压缩为PNG/JPG,以保持质量。