mineru-pdf

# MinerU PDF Parser Parse PDF documents using MinerU MCP to extract structured content including text, tables, and formulas with MLX acceleration on Apple Silicon. ## Installation ### Option 1: Install MinerU MCP (for Claude Code) ```bash claude mcp add --transport stdio --scope user mineru -- \ uvx --from mcp-mineru python -m mcp_mineru.server ``` This installs and configures MinerU for all Claude projects. Models are downloaded on first use. ### Option 2: Use Direct Tool (preserves files) The skill includes a direct parsing tool that saves output to a persistent directory: ```bash python /Users/lwj04/clawd/skills/mineru-pdf/parse.py <pdf_path> <output_dir> [options] ``` **Advantages:** - ✅ Files are saved permanently (not auto-deleted) - ✅ Full control over output location - ✅ No MCP overhead - ✅ Works with any Python environment that has MinerU ## Quick Start ### Method 1: Using the Direct Tool (Recommended) ```bash # Parse entire PDF python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \ "/path/to/document.pdf" \ "/path/to/output" # Parse specific pages python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \ "/path/to/document.pdf" \ "/path/to/output" \ --start-page 0 --end-page 2 # Use Apple Silicon optimization python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \ "/path/to/document.pdf" \ "/path/to/output" \ --backend vlm-mlx-engine # Text only (faster) python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \ "/path/to/document.pdf" \ "/path/to/output" \ --no-table --no-formula ``` ### Method 2: Using MinerU MCP (Temporary Files) ### Parse a PDF document ```bash uvx --from mcp-mineru python -c " import asyncio from mcp_mineru.server import call_tool async def parse_pdf(): result = await call_tool( name='parse_pdf', arguments={ 'file_path': '/path/to/document.pdf', 'backend': 'pipeline', 'formula_enable': True, 'table_enable': True, 'start_page': 0, 'end_page': -1 # -1 for all pages } ) if hasattr(result, 'content'): for item in result.content: if hasattr(item, 'text'): print(item.text) break asyncio.run(parse_pdf()) " ``` ### Check system capabilities ```bash uvx --from mcp-mineru python -c " import asyncio from mcp_mineru.server import call_tool async def list_backends(): result = await call_tool( name='list_backends', arguments={} ) if hasattr(result, 'content'): for item in result.content: if hasattr(item, 'text'): print(item.text) break asyncio.run(list_backends()) " ``` ## Parameters ### parse_pdf **Required:** - `file_path` - Absolute path to the PDF file **Optional:** - `backend` - Processing backend (default: `pipeline`) - `pipeline` - Fast, general-purpose (recommended) - `vlm-mlx-engine` - Fastest on Apple Silicon (M1/M2/M3/M4) - `vlm-transformers` - Slowest but most accurate - `formula_enable` - Enable formula recognition (default: `true`) - `table_enable` - Enable table recognition (default: `true`) - `start_page` - Starting page (0-indexed, default: `0`) - `end_page` - Ending page (default: `-1` for all pages) ### list_backends No parameters required. Returns system information and backend recommendations. ## Usage Examples ### Extract tables from a specific page range ```bash uvx --from mcp-mineru python -c " import asyncio from mcp_mineru.server import call_tool async def parse_pdf(): result = await call_tool( name='parse_pdf', arguments={ 'file_path': '/path/to/document.pdf', 'backend': 'pipeline', 'table_enable': True, 'start_page': 5, 'end_page': 10 } ) if hasattr(result, 'content'): for item in result.content: if hasattr(item, 'text'): print(item.text) break asyncio.run(parse_pdf()) " ``` ### Parse with formula recognition only (faster) ```bash uvx --from mcp-mineru python -c " import asyncio from mcp_mineru.server import call_tool async def parse_pdf(): result = await call_tool( name='parse_pdf', arguments={ 'file_path': '/path/to/document.pdf', 'backend': 'vlm-mlx-engine', 'formula_enable': True, 'table_enable': False # Disable for speed } ) if hasattr(result, 'content'): for item in result.content: if hasattr(item, 'text'): print(item.text) break asyncio.run(parse_pdf()) " ``` ### Parse single page (fastest for testing) ```bash uvx --from mcp-mineru python -c " import asyncio from mcp_mineru.server import call_tool async def parse_pdf(): result = await call_tool( name='parse_pdf', arguments={ 'file_path': '/path/to/document.pdf', 'backend': 'pipeline', 'formula_enable': False, 'table_enable': False, 'start_page': 0, 'end_page': 0 } ) if hasattr(result, 'content'): for item in result.content: if hasattr(item, 'text'): print(item.text) break asyncio.run(parse_pdf()) " ``` ## Performance On Apple Silicon M4 (16GB RAM): - `pipeline`: ~32s/page, CPU-only, good quality - `vlm-mlx-engine`: ~38s/page, Apple Silicon optimized, excellent quality - `vlm-transformers`: ~148s/page, highest quality, slowest **Note:** First run downloads models (can take 5-10 minutes). Models are cached in `~/.cache/uv/` for faster subsequent runs. ## Output Format Returns structured Markdown with: - Document metadata (file, backend, pages, settings) - Extracted text with preserved structure - Tables formatted as Markdown tables - Formulas converted to LaTeX ## Supported Formats - PDF documents (`.pdf`) - JPEG images (`.jpg`, `.jpeg`) - PNG images (`.png`) - Other image formats (WebP, GIF, etc.) ## Troubleshooting ### Module not found error If you get "No module named 'mcp_mineru'", make sure you installed it: ```bash claude mcp add --transport stdio --scope user mineru -- \ uvx --from mcp-mineru python -m mcp_mineru.server ``` ### Slow processing on first run This is normal. MinerU downloads ML models on first use. Subsequent runs will be much faster. ### Timeout errors Increase timeout for large documents or use smaller page ranges for testing. ## Notes - Output is returned as Markdown text - Tables are preserved in Markdown format - Mathematical formulas are converted to LaTeX - Works with scanned documents (OCR built-in) - Optimized for Apple Silicon (M1/M2/M3/M4) with MLX backend ## File Persistence ### Why Files Get Deleted (MCP Method) The MinerU MCP server uses Python's `tempfile.TemporaryDirectory()`, which automatically deletes files when the context exits. This is by design to prevent temporary files from accumulating. ### How to Preserve Files **Method A: Use the Direct Tool (Recommended)** The skill provides `parse.py` which saves files to a persistent directory: ```bash python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \ /path/to/input.pdf \ /path/to/output_dir ``` **Advantages:** - ✅ Files are never auto-deleted - ✅ Full control over output location - ✅ Can be used in batch processing - ✅ No MCP connection needed **Generated Structure:** ``` /path/to/output_dir/ ├── input.pdf_name/ │ └── auto/ # or vlm/ depending on backend │ ├── input.pdf_name.md │ └── images/ │ └── *.jpg └── input.pdf_name_parsed.md # Copy at root for easy access ``` **Method B: Redirect MCP Output** If using the MCP method, capture the output and save it: ```bash # Capture to file claude -p "Parse this PDF: /path/to/file.pdf" > /tmp/output.md # Or use within a script that saves the result ``` ### Comparison | Feature | Direct Tool | MCP Method | |----------|-------------|-------------| | Files persisted | ✅ Yes | ❌ No (auto-deleted) | | Custom output dir | ✅ Yes | ❌ No (temp only) | | Claude Code integration | ⚠️ Manual | ✅ Native | | Speed | ✅ Fast | ⚠️ MCP overhead | | Offline use | ✅ Yes | ⚠️ Needs Claude Code | ### Recommendation - **Use Direct Tool** when you need to keep the files for later use - **Use MCP Method** when working within Claude Code and only need the text content

mineru-pdf

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

mineru-pdf

mineru-pdf

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement