mineru-extractMinerU提取

Use the official MinerU (mineru.net) parsing API to convert a URL (HTML pages like WeChat articles, or direct PDF/Office/image links) into clean Markdown + structured outputs. Use when web_fetch/browser can’t access or extracts messy content, and you want higher-fidelity parsing (layout/table/formula/OCR).

作者: admin | 来源: ClawHub

MinerU Extract (official API)

Use MinerU as an upstream “content normalizer”: submit a URL to MinerU, poll for completion, download the result zip, and extract the main Markdown.

Quick start (MCP-aligned)

We align to the MinerU MCP mental model, but we do not run an MCP server.

- Primary script (MCP-style): INLINECODE0

- Input: --file-sources (comma/newline-separated) - Output: JSON contract on stdout: { ok, items, errors }

- Low-level script (single URL): INLINECODE3

Auth:

- Set MINERU_TOKEN (Bearer token from mineru.net)

Default model heuristic:

- URLs ending with .pdf/.doc/.ppt/.png/.jpg → INLINECODE6
Otherwise → MinerU-HTML (best for HTML pages like WeChat articles)

1) Configure token (skill-local)

Put secrets in skill root .env (do not paste into chat outputs):

CODEBLOCK0

2) Parse URL(s) → Markdown (recommended)

MCP-style wrapper (returns JSON, optionally includes markdown text):

CODEBLOCK1

If you want the markdown content inline in the JSON (can be large):

CODEBLOCK2

Low-level (single URL, print markdown to stdout):

CODEBLOCK3

Output

The script always downloads + extracts the MinerU result zip to:

INLINECODE9

It writes:

- INLINECODE10
extracted files (Markdown + JSON + assets)

It prints a JSON summary to stderr with paths:

- task_id, full_zip_url, out_dir, INLINECODE14

Parameters (common)

- --model: pipeline | vlm | MinerU-HTML (HTML requires MinerU-HTML)
INLINECODE18: enable OCR (effective for pipeline/vlm)
INLINECODE21: table recognition
INLINECODE22: formula recognition
INLINECODE23
INLINECODE24 (non-HTML)
INLINECODE25 / INLINECODE26

Failure modes & fallbacks

- MinerU may fail to fetch some URLs (anti-bot / geo / login).

- Fallback: provide an HTML file or a PDF/long screenshot; then implement “upload + parse” flow with MinerU batch upload endpoints. - Always report the failing URL + MinerU err_msg and keep an original-source link in outputs.

References

- MinerU API docs: https://mineru.net/apiManage/docs
MinerU output files: https://opendatalab.github.io/MinerU/reference/output_files/

MinerU Extract (官方API)

将MinerU用作上游内容标准化器：向MinerU提交URL，轮询完成状态，下载结果压缩包，并提取主要Markdown内容。

快速开始（MCP对齐）

我们遵循MinerU MCP思维模型，但不运行MCP服务器。

- 主脚本（MCP风格）：scripts/mineruparsedocuments.py

- 输入：--file-sources（逗号/换行符分隔） - 输出：标准输出的JSON合约：{ ok, items, errors }

- 底层脚本（单URL）：scripts/mineru_extract.py

认证：

- 设置MINERU_TOKEN（来自mineru.net的Bearer令牌）

默认模型启发式规则：

- 以.pdf/.doc/.ppt/.png/.jpg结尾的URL → pipeline
其他情况 → MinerU-HTML（最适合微信公众号文章等HTML页面）

1）配置令牌（技能本地）

将密钥放在技能根目录的.env文件中（不要粘贴到聊天输出中）：

bash

在mineru-extract技能目录中：.env

MINERUTOKEN=yourtoken_here
MINERUAPIBASE=https://mineru.net

2）解析URL → Markdown（推荐）

MCP风格包装器（返回JSON，可选包含markdown文本）：

bash
python3 mineru-extract/scripts/mineruparsedocuments.py \
--file-sources \n \
--language ch \
--enable-ocr \
--model-version MinerU-HTML

如果希望markdown内容内联在JSON中（可能较大）：

bash
python3 mineru-extract/scripts/mineruparsedocuments.py \
--file-sources \
--model-version MinerU-HTML \
--emit-markdown --max-chars 20000

底层（单URL，将markdown输出到标准输出）：

bash
python3 mineru-extract/scripts/mineru_extract.py --model MinerU-HTML --print > /tmp/out.md

输出

脚本始终将MinerU结果压缩包下载并解压到：

~/.openclaw/workspace/mineru//

它会写入：

- result.zip
解压后的文件（Markdown + JSON + 资源文件）

它会向标准错误输出包含路径的JSON摘要：

- taskid, fullzipurl, outdir, markdown_path

参数（通用）

- --model：pipeline | vlm | MinerU-HTML（HTML需要MinerU-HTML）
--ocr/--no-ocr：启用OCR（对pipeline/vlm有效）
--table/--no-table：表格识别
--formula/--no-formula：公式识别
--language ch|en|...
--page-ranges 2,4-6（非HTML）
--timeout 600 / --poll-interval 2

失败模式与回退方案

- MinerU可能无法获取某些URL（反爬虫/地域限制/登录要求）

- 回退方案：提供HTML文件或PDF/长截图；然后通过MinerU批量上传端点实现上传+解析流程 - 始终报告失败的URL + MinerU的err_msg，并在输出中保留原始来源链接

参考资料

- MinerU API文档：https://mineru.net/apiManage/docs
MinerU输出文件：https://opendatalab.github.io/MinerU/reference/output_files/

mineru-extractMinerU提取

mineru-extract

MinerU Extract (official API)