Transcribe with podcast-helper
Generate transcript artifacts from a podcast episode, audio file, or raw transcript, with an optional cleanup pass that uses episode-page context.
Default Workflow
- 1. Choose a dedicated output directory such as
./out/<episode-slug>/. - Run
npx podcast-helper transcribe <input> --output-dir <dir> --json. - Add
--progress jsonl only when machine-readable progress is needed. - Report the generated artifact paths for audio,
.srt, and .txt. - Ask whether the user wants cleanup. Do not run cleanup implicitly.
If you are already inside this repository and dist/cli.js exists, node dist/cli.js ... is acceptable. Do not default to repository-local build steps outside this repository.
If you are inside this repository and dist/cli.js is missing, run pnpm run build before using the repo-local entry point.
Gotchas
- - Prefer no-install entry points first:
npx, then pnpm dlx, then a globally installed podcast-helper. - Let the CLI auto-select the engine unless the user explicitly requests a backend or needs offline Apple Silicon transcription.
- Spotify URLs are unsupported because the audio is DRM-protected. Ask for an RSS-backed episode page, Apple Podcasts link, or direct audio URL instead.
- YouTube inputs require
yt-dlp. - Generic episode pages sometimes hide audio metadata. If source resolution fails, download the audio separately and rerun with the file path.
- Hosted transcription failures usually come from a missing or wrong provider API key.
- Local
mlx-whisper runs require ffmpeg, python3, and a working runtime from podcast-helper setup mlx-whisper. - Keep the raw transcript untouched. Cleanup should write a sibling
*.cleaned.txt.
Command Forms
Default:
CODEBLOCK0
Fallbacks:
- - INLINECODE18
- INLINECODE19
- INLINECODE20 only inside this repository
For offline Apple Silicon:
CODEBLOCK1
Cleanup Branch
Only enter cleanup when the user asks for it or already has a raw transcript.
- 1. Fetch episode context with
curl https://r.jina.ai/<podcast-url>. - Use the page as reference context for obvious ASR repairs, especially names and proper nouns.
- Do not summarize, invent missing content, or overwrite the raw transcript.
- Write a sibling
*.cleaned.txt file.
If no episode URL is available, clean conservatively and explicitly say that external episode context was not used.
References
- - Read
references/inputs-and-engines.md for supported inputs, engine selection, and dependency notes. - Read
references/output-contract.md for the JSON success and failure envelopes and progress handling. - Read
references/cleanup.md for detailed cleanup rules and conservative editing guidance. - Read
references/verification.md for smoke-test inputs and verification steps. - Read
references/setup.md when installing this skill into Claude Code, OpenClaw, or other agents.
技能名称: transcribe
详细描述:
使用 podcast-helper 进行转录
从播客剧集、音频文件或原始转录文本生成转录产物,并可选择使用剧集页面上下文进行清理。
默认工作流程
- 1. 选择一个专用输出目录,例如 ./out/<剧集标识>/。
- 运行 npx podcast-helper transcribe <输入> --output-dir <目录> --json。
- 仅在需要机器可读的进度信息时添加 --progress jsonl。
- 报告生成的音频、.srt 和 .txt 产物路径。
- 询问用户是否需要清理。不要隐式运行清理。
如果你已经在此仓库内且 dist/cli.js 存在,可以使用 node dist/cli.js ...。不要默认在此仓库之外使用仓库本地构建步骤。
如果你在此仓库内且 dist/cli.js 缺失,请在使用仓库本地入口点之前运行 pnpm run build。
注意事项
- - 优先使用无需安装的入口点:npx,然后是 pnpm dlx,最后是全局安装的 podcast-helper。
- 让 CLI 自动选择引擎,除非用户明确请求某个后端或需要离线 Apple Silicon 转录。
- 不支持 Spotify URL,因为音频受 DRM 保护。请改用支持 RSS 的剧集页面、Apple Podcasts 链接或直接音频 URL。
- YouTube 输入需要 yt-dlp。
- 通用剧集页面有时会隐藏音频元数据。如果源解析失败,请单独下载音频并使用文件路径重新运行。
- 托管转录失败通常是由于缺少或错误的提供商 API 密钥。
- 本地 mlx-whisper 运行需要 ffmpeg、python3 以及来自 podcast-helper setup mlx-whisper 的可用运行时。
- 保持原始转录文本不变。清理应写入一个同级的 *.cleaned.txt 文件。
命令形式
默认:
bash
npx podcast-helper transcribe <输入> --output-dir ./out/<标识> --json
备选:
- - pnpm dlx podcast-helper transcribe <输入> --output-dir ./out/<标识> --json
- podcast-helper transcribe <输入> --output-dir ./out/<标识> --json
- node dist/cli.js transcribe <输入> --output-dir ./out/<标识> --json 仅在此仓库内
用于离线 Apple Silicon:
bash
npx podcast-helper transcribe <输入> --engine mlx-whisper --output-dir ./out/<标识> --json
清理分支
仅在用户要求或已有原始转录文本时进入清理流程。
- 1. 使用 curl https://r.jina.ai/<播客URL> 获取剧集上下文。
- 将页面作为参考上下文进行明显的 ASR 修复,特别是名称和专有名词。
- 不要总结、编造缺失内容或覆盖原始转录文本。
- 写入一个同级的 *.cleaned.txt 文件。
如果没有可用的剧集 URL,请保守地进行清理,并明确说明未使用外部剧集上下文。
参考资料
- - 阅读 references/inputs-and-engines.md 了解支持的输入、引擎选择和依赖说明。
- 阅读 references/output-contract.md 了解 JSON 成功和失败信封以及进度处理。
- 阅读 references/cleanup.md 了解详细的清理规则和保守编辑指南。
- 阅读 references/verification.md 了解冒烟测试输入和验证步骤。
- 在将此技能安装到 Claude Code、OpenClaw 或其他代理时,阅读 references/setup.md。