MinerU Extract (official API)
Use MinerU as an upstream “content normalizer”: submit a URL to MinerU, poll for completion, download the result zip, and extract the main Markdown.
Quick start (MCP-aligned)
We align to the MinerU MCP mental model, but we do not run an MCP server.
- - Primary script (MCP-style): INLINECODE0
- Input:
--file-sources (comma/newline-separated)
- Output:
JSON contract on stdout:
{ ok, items, errors }
- - Low-level script (single URL): INLINECODE3
Auth:
- - Set
MINERU_TOKEN (Bearer token from mineru.net)
Default model heuristic:
- - URLs ending with
.pdf/.doc/.ppt/.png/.jpg → INLINECODE6 - Otherwise →
MinerU-HTML (best for HTML pages like WeChat articles)
1) Configure token (skill-local)
Put secrets in skill root .env (do not paste into chat outputs):
CODEBLOCK0
2) Parse URL(s) → Markdown (recommended)
MCP-style wrapper (returns JSON, optionally includes markdown text):
CODEBLOCK1
If you want the markdown content inline in the JSON (can be large):
CODEBLOCK2
Low-level (single URL, print markdown to stdout):
CODEBLOCK3
Output
The script always downloads + extracts the MinerU result zip to:
INLINECODE9
It writes:
- - INLINECODE10
- extracted files (Markdown + JSON + assets)
It prints a JSON summary to stderr with paths:
- -
task_id, full_zip_url, out_dir, INLINECODE14
Parameters (common)
- -
--model: pipeline | vlm | MinerU-HTML (HTML requires MinerU-HTML) - INLINECODE18 : enable OCR (effective for
pipeline/vlm) - INLINECODE21 : table recognition
- INLINECODE22 : formula recognition
- INLINECODE23
- INLINECODE24 (non-HTML)
- INLINECODE25 / INLINECODE26
Failure modes & fallbacks
- - MinerU may fail to fetch some URLs (anti-bot / geo / login).
- Fallback: provide an HTML file or a PDF/long screenshot; then implement “upload + parse” flow with MinerU batch upload endpoints.
- Always report the failing URL + MinerU
err_msg and keep an original-source link in outputs.
References
- - MinerU API docs: https://mineru.net/apiManage/docs
- MinerU output files: https://opendatalab.github.io/MinerU/reference/output_files/
MinerU Extract (官方API)
将MinerU用作上游内容标准化器:向MinerU提交URL,轮询完成状态,下载结果压缩包,并提取主要Markdown内容。
快速开始(MCP对齐)
我们遵循MinerU MCP思维模型,但不运行MCP服务器。
- - 主脚本(MCP风格):scripts/mineruparsedocuments.py
- 输入:--file-sources(逗号/换行符分隔)
- 输出:标准输出的
JSON合约:{ ok, items, errors }
- - 底层脚本(单URL):scripts/mineru_extract.py
认证:
- - 设置MINERU_TOKEN(来自mineru.net的Bearer令牌)
默认模型启发式规则:
- - 以.pdf/.doc/.ppt/.png/.jpg结尾的URL → pipeline
- 其他情况 → MinerU-HTML(最适合微信公众号文章等HTML页面)
1)配置令牌(技能本地)
将密钥放在技能根目录的.env文件中(不要粘贴到聊天输出中):
bash
在mineru-extract技能目录中:.env
MINERU
TOKEN=yourtoken_here
MINERU
APIBASE=https://mineru.net
2)解析URL → Markdown(推荐)
MCP风格包装器(返回JSON,可选包含markdown文本):
bash
python3 mineru-extract/scripts/mineruparsedocuments.py \
--file-sources \n \
--language ch \
--enable-ocr \
--model-version MinerU-HTML
如果希望markdown内容内联在JSON中(可能较大):
bash
python3 mineru-extract/scripts/mineruparsedocuments.py \
--file-sources \
--model-version MinerU-HTML \
--emit-markdown --max-chars 20000
底层(单URL,将markdown输出到标准输出):
bash
python3 mineru-extract/scripts/mineru_extract.py --model MinerU-HTML --print > /tmp/out.md
输出
脚本始终将MinerU结果压缩包下载并解压到:
~/.openclaw/workspace/mineru//
它会写入:
- - result.zip
- 解压后的文件(Markdown + JSON + 资源文件)
它会向标准错误输出包含路径的JSON摘要:
- - taskid, fullzipurl, outdir, markdown_path
参数(通用)
- - --model:pipeline | vlm | MinerU-HTML(HTML需要MinerU-HTML)
- --ocr/--no-ocr:启用OCR(对pipeline/vlm有效)
- --table/--no-table:表格识别
- --formula/--no-formula:公式识别
- --language ch|en|...
- --page-ranges 2,4-6(非HTML)
- --timeout 600 / --poll-interval 2
失败模式与回退方案
- - MinerU可能无法获取某些URL(反爬虫/地域限制/登录要求)
- 回退方案:提供HTML文件或PDF/长截图;然后通过MinerU批量上传端点实现上传+解析流程
- 始终报告失败的URL + MinerU的err_msg,并在输出中保留原始来源链接
参考资料
- - MinerU API文档:https://mineru.net/apiManage/docs
- MinerU输出文件:https://opendatalab.github.io/MinerU/reference/output_files/