mineru-pdf
# MinerU PDF Parser
Parse PDF documents using MinerU MCP to extract structured content including text, tables, and formulas with MLX acceleration on Apple Silicon.
## Installation
### Option 1: Install MinerU MCP (for Claude Code)
```bash
claude mcp add --transport stdio --scope user mineru -- \
uvx --from mcp-mineru python -m mcp_mineru.server
```
This installs and configures MinerU for all Claude projects. Models are downloaded on first use.
### Option 2: Use Direct Tool (preserves files)
The skill includes a direct parsing tool that saves output to a persistent directory:
```bash
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py <pdf_path> <output_dir> [options]
```
**Advantages:**
- ✅ Files are saved permanently (not auto-deleted)
- ✅ Full control over output location
- ✅ No MCP overhead
- ✅ Works with any Python environment that has MinerU
## Quick Start
### Method 1: Using the Direct Tool (Recommended)
```bash
# Parse entire PDF
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
"/path/to/document.pdf" \
"/path/to/output"
# Parse specific pages
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
"/path/to/document.pdf" \
"/path/to/output" \
--start-page 0 --end-page 2
# Use Apple Silicon optimization
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
"/path/to/document.pdf" \
"/path/to/output" \
--backend vlm-mlx-engine
# Text only (faster)
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
"/path/to/document.pdf" \
"/path/to/output" \
--no-table --no-formula
```
### Method 2: Using MinerU MCP (Temporary Files)
### Parse a PDF document
```bash
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def parse_pdf():
result = await call_tool(
name='parse_pdf',
arguments={
'file_path': '/path/to/document.pdf',
'backend': 'pipeline',
'formula_enable': True,
'table_enable': True,
'start_page': 0,
'end_page': -1 # -1 for all pages
}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(parse_pdf())
"
```
### Check system capabilities
```bash
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def list_backends():
result = await call_tool(
name='list_backends',
arguments={}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(list_backends())
"
```
## Parameters
### parse_pdf
**Required:**
- `file_path` - Absolute path to the PDF file
**Optional:**
- `backend` - Processing backend (default: `pipeline`)
- `pipeline` - Fast, general-purpose (recommended)
- `vlm-mlx-engine` - Fastest on Apple Silicon (M1/M2/M3/M4)
- `vlm-transformers` - Slowest but most accurate
- `formula_enable` - Enable formula recognition (default: `true`)
- `table_enable` - Enable table recognition (default: `true`)
- `start_page` - Starting page (0-indexed, default: `0`)
- `end_page` - Ending page (default: `-1` for all pages)
### list_backends
No parameters required. Returns system information and backend recommendations.
## Usage Examples
### Extract tables from a specific page range
```bash
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def parse_pdf():
result = await call_tool(
name='parse_pdf',
arguments={
'file_path': '/path/to/document.pdf',
'backend': 'pipeline',
'table_enable': True,
'start_page': 5,
'end_page': 10
}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(parse_pdf())
"
```
### Parse with formula recognition only (faster)
```bash
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def parse_pdf():
result = await call_tool(
name='parse_pdf',
arguments={
'file_path': '/path/to/document.pdf',
'backend': 'vlm-mlx-engine',
'formula_enable': True,
'table_enable': False # Disable for speed
}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(parse_pdf())
"
```
### Parse single page (fastest for testing)
```bash
uvx --from mcp-mineru python -c "
import asyncio
from mcp_mineru.server import call_tool
async def parse_pdf():
result = await call_tool(
name='parse_pdf',
arguments={
'file_path': '/path/to/document.pdf',
'backend': 'pipeline',
'formula_enable': False,
'table_enable': False,
'start_page': 0,
'end_page': 0
}
)
if hasattr(result, 'content'):
for item in result.content:
if hasattr(item, 'text'):
print(item.text)
break
asyncio.run(parse_pdf())
"
```
## Performance
On Apple Silicon M4 (16GB RAM):
- `pipeline`: ~32s/page, CPU-only, good quality
- `vlm-mlx-engine`: ~38s/page, Apple Silicon optimized, excellent quality
- `vlm-transformers`: ~148s/page, highest quality, slowest
**Note:** First run downloads models (can take 5-10 minutes). Models are cached in `~/.cache/uv/` for faster subsequent runs.
## Output Format
Returns structured Markdown with:
- Document metadata (file, backend, pages, settings)
- Extracted text with preserved structure
- Tables formatted as Markdown tables
- Formulas converted to LaTeX
## Supported Formats
- PDF documents (`.pdf`)
- JPEG images (`.jpg`, `.jpeg`)
- PNG images (`.png`)
- Other image formats (WebP, GIF, etc.)
## Troubleshooting
### Module not found error
If you get "No module named 'mcp_mineru'", make sure you installed it:
```bash
claude mcp add --transport stdio --scope user mineru -- \
uvx --from mcp-mineru python -m mcp_mineru.server
```
### Slow processing on first run
This is normal. MinerU downloads ML models on first use. Subsequent runs will be much faster.
### Timeout errors
Increase timeout for large documents or use smaller page ranges for testing.
## Notes
- Output is returned as Markdown text
- Tables are preserved in Markdown format
- Mathematical formulas are converted to LaTeX
- Works with scanned documents (OCR built-in)
- Optimized for Apple Silicon (M1/M2/M3/M4) with MLX backend
## File Persistence
### Why Files Get Deleted (MCP Method)
The MinerU MCP server uses Python's `tempfile.TemporaryDirectory()`, which automatically deletes files when the context exits. This is by design to prevent temporary files from accumulating.
### How to Preserve Files
**Method A: Use the Direct Tool (Recommended)**
The skill provides `parse.py` which saves files to a persistent directory:
```bash
python /Users/lwj04/clawd/skills/mineru-pdf/parse.py \
/path/to/input.pdf \
/path/to/output_dir
```
**Advantages:**
- ✅ Files are never auto-deleted
- ✅ Full control over output location
- ✅ Can be used in batch processing
- ✅ No MCP connection needed
**Generated Structure:**
```
/path/to/output_dir/
├── input.pdf_name/
│ └── auto/ # or vlm/ depending on backend
│ ├── input.pdf_name.md
│ └── images/
│ └── *.jpg
└── input.pdf_name_parsed.md # Copy at root for easy access
```
**Method B: Redirect MCP Output**
If using the MCP method, capture the output and save it:
```bash
# Capture to file
claude -p "Parse this PDF: /path/to/file.pdf" > /tmp/output.md
# Or use within a script that saves the result
```
### Comparison
| Feature | Direct Tool | MCP Method |
|----------|-------------|-------------|
| Files persisted | ✅ Yes | ❌ No (auto-deleted) |
| Custom output dir | ✅ Yes | ❌ No (temp only) |
| Claude Code integration | ⚠️ Manual | ✅ Native |
| Speed | ✅ Fast | ⚠️ MCP overhead |
| Offline use | ✅ Yes | ⚠️ Needs Claude Code |
### Recommendation
- **Use Direct Tool** when you need to keep the files for later use
- **Use MCP Method** when working within Claude Code and only need the text content
标签
skill
ai