YouTube Transcribe
Smart YouTube video transcription with automatic fallback:
- 1. Captions first — extracts existing subtitles (manual or auto-generated) via yt-dlp. Fast, free, no compute.
- Whisper fallback — when no captions exist, downloads audio and transcribes locally with the best available Whisper backend.
When to Use
Use this skill when the user wants to:
- - Get a transcript or text version of a YouTube video
- Understand what a YouTube video says without watching it
- Summarize, analyze, or take notes from a YouTube video
- Extract subtitles or captions from a video
Triggers
- - "transcribe this YouTube video"
- "what does this video say"
- "get the transcript of [YouTube URL]"
- "summarize this YouTube video" (transcribe first, then process)
- Any YouTube URL shared with a request to understand its content
Requirements
Required:
- -
yt-dlp — for caption extraction and audio download - INLINECODE1
For Whisper fallback (when no captions available):
- -
ffmpeg — for audio processing - One of these Whisper backends (auto-detected in priority order):
1.
mlx-whisper — Apple Silicon native, fastest on Mac (pip install mlx-whisper)
2.
faster-whisper — CTranslate2 backend, fast on CUDA/CPU (pip install faster-whisper)
3.
openai-whisper — Original Whisper, universal fallback (pip install openai-whisper)
Usage
Basic — transcribe a video
CODEBLOCK0
Specify language for captions
CODEBLOCK1
Force Whisper (skip caption check)
CODEBLOCK2
JSON output
CODEBLOCK3
Save to file
CODEBLOCK4
Options
| Flag | Default | Description |
|---|
| INLINECODE6 | auto | Preferred subtitle/transcription language (e.g. zh, en, ja) |
| INLINECODE10 |
text | Output format:
text,
json,
srt,
vtt |
|
--output | stdout | Save transcript to file |
|
--force-whisper | false | Skip caption extraction, go straight to Whisper |
|
--backend | auto | Whisper backend:
auto,
mlx,
faster-whisper,
whisper |
|
--model | auto | Whisper model size:
auto,
large-v3,
medium,
small,
base,
tiny |
Environment Variables
| Variable | Description |
|---|
| INLINECODE30 | Override Whisper backend selection |
| INLINECODE31 |
Override Whisper model size |
Auto-Detection
Whisper Backend (priority order)
- 1. MLX Whisper — detected via
import mlx_whisper. Best for Apple Silicon. - faster-whisper — detected via
import faster_whisper. Best for CUDA GPU, good on CPU. - OpenAI Whisper — detected via
import whisper. Universal fallback.
Model Size (based on available RAM)
| RAM | Model | VRAM/RAM Usage |
|---|
| ≥16GB | INLINECODE35 | ~6-10GB |
| ≥8GB |
medium | ~5GB |
| ≥4GB |
small | ~2.5GB |
| <4GB |
base | ~1.5GB |
Caption Language Priority
When --language is not specified, captions are searched in this order:
- 1. Video's original language
- Chinese variants:
zh-Hant, zh-Hans, zh-TW, zh-CN, INLINECODE44 - English: INLINECODE45
- Any available language
Output Formats
text (default)
Plain text transcript, one continuous block.
json
CODEBLOCK5
srt / vtt
Standard subtitle formats with timestamps.
YouTube 转录
智能 YouTube 视频转录,自动回退机制:
- 1. 优先字幕 — 通过 yt-dlp 提取现有字幕(手动或自动生成)。快速、免费、无需计算。
- Whisper 回退 — 当无字幕时,下载音频并使用最佳可用 Whisper 后端进行本地转录。
使用场景
当用户需要以下操作时使用此技能:
- - 获取 YouTube 视频的转录文本或文字版本
- 无需观看即可了解 YouTube 视频内容
- 对 YouTube 视频进行总结、分析或做笔记
- 提取视频的字幕或说明文字
触发条件
- - 转录这个 YouTube 视频
- 这个视频说了什么
- 获取 [YouTube URL] 的转录文本
- 总结这个 YouTube 视频 (先转录,再处理)
- 任何附带理解内容请求的 YouTube URL
环境要求
必需:
- - yt-dlp — 用于字幕提取和音频下载
- python3
Whisper 回退(无字幕时):
- - ffmpeg — 用于音频处理
- 以下任一 Whisper 后端(按优先级自动检测):
1. mlx-whisper — Apple Silicon 原生,Mac 上最快(pip install mlx-whisper)
2. faster-whisper — CTranslate2 后端,CUDA/CPU 上快速(pip install faster-whisper)
3. openai-whisper — 原始 Whisper,通用回退(pip install openai-whisper)
使用方法
基础 — 转录视频
bash
python3 {baseDir}/scripts/transcribe.py https://www.youtube.com/watch?v=VIDEO_ID
指定字幕语言
bash
python3 {baseDir}/scripts/transcribe.py URL --language zh
强制使用 Whisper(跳过字幕检查)
bash
python3 {baseDir}/scripts/transcribe.py URL --force-whisper
JSON 输出
bash
python3 {baseDir}/scripts/transcribe.py URL --format json
保存到文件
bash
python3 {baseDir}/scripts/transcribe.py URL --output transcript.txt
选项
| 标志 | 默认值 | 描述 |
|---|
| --language | auto | 首选字幕/转录语言(例如 zh、en、ja) |
| --format |
text | 输出格式:text、json、srt、vtt |
| --output | stdout | 将转录文本保存到文件 |
| --force-whisper | false | 跳过字幕提取,直接使用 Whisper |
| --backend | auto | Whisper 后端:auto、mlx、faster-whisper、whisper |
| --model | auto | Whisper 模型大小:auto、large-v3、medium、small、base、tiny |
环境变量
| 变量 | 描述 |
|---|
| YTWHISPERBACKEND | 覆盖 Whisper 后端选择 |
| YTWHISPERMODEL |
覆盖 Whisper 模型大小 |
自动检测
Whisper 后端(优先级顺序)
- 1. MLX Whisper — 通过 import mlxwhisper 检测。最适合 Apple Silicon。
- faster-whisper — 通过 import fasterwhisper 检测。最适合 CUDA GPU,CPU 上表现良好。
- OpenAI Whisper — 通过 import whisper 检测。通用回退方案。
模型大小(基于可用内存)
| 内存 | 模型 | 显存/内存占用 |
|---|
| ≥16GB | large-v3 | ~6-10GB |
| ≥8GB |
medium | ~5GB |
| ≥4GB | small | ~2.5GB |
| <4GB | base | ~1.5GB |
字幕语言优先级
当未指定 --language 时,按以下顺序搜索字幕:
- 1. 视频原始语言
- 中文变体:zh-Hant、zh-Hans、zh-TW、zh-CN、zh
- 英语:en
- 任何可用语言
输出格式
text(默认)
纯文本转录,连续文本块。
json
json
{
video_id: ZSnYlbIYpjs,
title: 视频标题,
channel: 频道名称,
duration: 708,
language: zh,
method: captions,
transcript: [
{start: 0.0, end: 5.2, text: ...},
...
],
full_text: 完整转录文本作为单个字符串
}
srt / vtt
带时间戳的标准字幕格式。