YouTube Transcribe

Smart YouTube video transcription with automatic fallback:

1. Captions first — extracts existing subtitles (manual or auto-generated) via yt-dlp. Fast, free, no compute.
Whisper fallback — when no captions exist, downloads audio and transcribes locally with the best available Whisper backend.

When to Use

Use this skill when the user wants to:

- Get a transcript or text version of a YouTube video
Understand what a YouTube video says without watching it
Summarize, analyze, or take notes from a YouTube video
Extract subtitles or captions from a video

Triggers

- "transcribe this YouTube video"
"what does this video say"
"get the transcript of [YouTube URL]"
"summarize this YouTube video" (transcribe first, then process)
Any YouTube URL shared with a request to understand its content

Requirements

Required:

- yt-dlp — for caption extraction and audio download
INLINECODE1

For Whisper fallback (when no captions available):

- ffmpeg — for audio processing
One of these Whisper backends (auto-detected in priority order):

1. mlx-whisper — Apple Silicon native, fastest on Mac (pip install mlx-whisper)
2. faster-whisper — CTranslate2 backend, fast on CUDA/CPU (pip install faster-whisper)
3. openai-whisper — Original Whisper, universal fallback (pip install openai-whisper)

Usage

Basic — transcribe a video

CODEBLOCK0

Specify language for captions

CODEBLOCK1

Force Whisper (skip caption check)

CODEBLOCK2

JSON output

CODEBLOCK3

Save to file

CODEBLOCK4

Options

Flag	Default	Description
INLINECODE6	auto	Preferred subtitle/transcription language (e.g. `zh`, `en`, `ja`)
INLINECODE10

Environment Variables

Variable	Description
INLINECODE30	Override Whisper backend selection
INLINECODE31

Override Whisper model size |

Auto-Detection

Whisper Backend (priority order)

1. MLX Whisper — detected via import mlx_whisper. Best for Apple Silicon.
faster-whisper — detected via import faster_whisper. Best for CUDA GPU, good on CPU.
OpenAI Whisper — detected via import whisper. Universal fallback.

Model Size (based on available RAM)
RAM Model VRAM/RAM Usage
≥16GB INLINECODE35 ~6-10GB
≥8GB
`medium` | ~5GB |

RAM	Model	VRAM/RAM Usage
≥16GB	INLINECODE35	~6-10GB
≥8GB

| ≥4GB | small | ~2.5GB | | <4GB | base | ~1.5GB |

Caption Language Priority

When --language is not specified, captions are searched in this order:

1. Video's original language
Chinese variants: zh-Hant, zh-Hans, zh-TW, zh-CN, INLINECODE44
English: INLINECODE45
Any available language

Output Formats

text (default)

Plain text transcript, one continuous block.

json

CODEBLOCK5

srt / vtt

Standard subtitle formats with timestamps.

YouTube 转录

智能 YouTube 视频转录，自动回退机制：

1. 优先字幕 — 通过 yt-dlp 提取现有字幕（手动或自动生成）。快速、免费、无需计算。
Whisper 回退 — 当无字幕时，下载音频并使用最佳可用 Whisper 后端进行本地转录。

使用场景

当用户需要以下操作时使用此技能：

- 获取 YouTube 视频的转录文本或文字版本
无需观看即可了解 YouTube 视频内容
对 YouTube 视频进行总结、分析或做笔记
提取视频的字幕或说明文字

触发条件

- 转录这个 YouTube 视频
这个视频说了什么
获取 [YouTube URL] 的转录文本
总结这个 YouTube 视频 (先转录，再处理)
任何附带理解内容请求的 YouTube URL

环境要求

必需：

- yt-dlp — 用于字幕提取和音频下载
python3

Whisper 回退（无字幕时）：

- ffmpeg — 用于音频处理
以下任一 Whisper 后端（按优先级自动检测）：

1. mlx-whisper — Apple Silicon 原生，Mac 上最快（pip install mlx-whisper）
2. faster-whisper — CTranslate2 后端，CUDA/CPU 上快速（pip install faster-whisper）
3. openai-whisper — 原始 Whisper，通用回退（pip install openai-whisper）

使用方法

基础 — 转录视频

bash
python3 {baseDir}/scripts/transcribe.py https://www.youtube.com/watch?v=VIDEO_ID

指定字幕语言

bash
python3 {baseDir}/scripts/transcribe.py URL --language zh

强制使用 Whisper（跳过字幕检查）

bash
python3 {baseDir}/scripts/transcribe.py URL --force-whisper

JSON 输出

bash
python3 {baseDir}/scripts/transcribe.py URL --format json

保存到文件

bash
python3 {baseDir}/scripts/transcribe.py URL --output transcript.txt

选项

标志	默认值	描述
--language	auto	首选字幕/转录语言（例如 zh、en、ja）
--format

环境变量

变量	描述
YTWHISPERBACKEND	覆盖 Whisper 后端选择
YTWHISPERMODEL

覆盖 Whisper 模型大小 |

自动检测

Whisper 后端（优先级顺序）

1. MLX Whisper — 通过 import mlxwhisper 检测。最适合 Apple Silicon。
faster-whisper — 通过 import fasterwhisper 检测。最适合 CUDA GPU，CPU 上表现良好。
OpenAI Whisper — 通过 import whisper 检测。通用回退方案。

模型大小（基于可用内存）
内存模型显存/内存占用
≥16GB large-v3 ~6-10GB
≥8GB
medium | ~5GB |

内存	模型	显存/内存占用
≥16GB	large-v3	~6-10GB
≥8GB

| ≥4GB | small | ~2.5GB | | <4GB | base | ~1.5GB |

字幕语言优先级

当未指定 --language 时，按以下顺序搜索字幕：

1. 视频原始语言
中文变体：zh-Hant、zh-Hans、zh-TW、zh-CN、zh
英语：en
任何可用语言

输出格式

text（默认）

纯文本转录，连续文本块。

json

json { video_id: ZSnYlbIYpjs, title: 视频标题, channel: 频道名称, duration: 708, language: zh, method: captions, transcript: [ {start: 0.0, end: 5.2, text: ...}, ... ], full_text: 完整转录文本作为单个字符串 }

srt / vtt

带时间戳的标准字幕格式。

youtube-transcribeYouTube转录