Video Subtitle Generator
Multilingual video subtitle generation and translation toolkit built on WhisperX.
Features
- - Speech transcription: Extract audio from video and transcribe it into subtitles with automatic source language detection
- Multilingual translation: Translate subtitles from any source language into a configurable target language
- Bilingual subtitles: Generate source + target bilingual subtitles
Prerequisites
- - Python 3.9+
- ffmpeg (required by WhisperX for audio extraction)
CODEBLOCK0
Resource requirements
Before running, confirm the user is aware of the following costs:
| Resource | Details |
|---|
| Disk | ffmpeg ~80 MB; Python packages (torch, whisperx, etc.) 2–5 GB; Whisper model weights 39 MB – 1.5 GB depending on model size |
| CPU / GPU |
WhisperX runs model inference locally. A CUDA GPU is strongly recommended for
medium and
large models. CPU and Apple MPS also work but are significantly slower |
|
Network / API | Translation step calls a remote LLM API and incurs token-based charges. No network is needed for the transcription step once the model is downloaded |
Always confirm with the user before installing packages or downloading models, as these operations consume storage and bandwidth.
Translation requires an LLM API and will incur costs. Before executing the translation step:
- 1. Ask the user for the API provider, key, and base URL — or present any auto-discovered configuration for review
- Inform the user that translation calls a remote LLM and will consume tokens (i.e. real money)
- Do NOT proceed with translation until the user explicitly confirms the provider and acknowledges the cost
Usage
1. Environment setup
CODEBLOCK1
On Windows, use python instead of python3 in all commands below.
2. Transcribe video (auto-detect language)
CODEBLOCK2
Output: video.{detected_lang}.srt (e.g. video.en.srt, video.ja.srt)
Arguments:
- -
-o: Output directory - INLINECODE8 : Model size (
tiny, base, small, medium, large) - INLINECODE14 : Device (
cuda, cpu, mps), auto-detected by default - INLINECODE18 : Force source language code (e.g.
en, ja, zh). Auto-detect if omitted
3. Batch-process a directory
CODEBLOCK3
4. Translate subtitles
Cost warning: This step calls a remote LLM API. Ensure the user has confirmed the API provider, key, and billing awareness before running.
CODEBLOCK4
Arguments:
- -
-t, --target-lang: Target language code (default: zh) - INLINECODE25 : Generate bilingual (source + target) subtitles
- INLINECODE26 : Generate target-language-only subtitles
- INLINECODE27 : Translation model (default:
google/gemini-3-flash-preview) - INLINECODE29 : Batch size (default:
10)
When neither --bilingual nor --target-only is specified, both are generated.
5. Run the full pipeline
CODEBLOCK5
Environment variables for run.py:
- -
VIDEO_DIR: Video source directory (default: ./videos) - INLINECODE36 : Transcription output directory (default:
./output) - INLINECODE38 : Translation output directory (default:
./translated) - INLINECODE40 : Target language code (default:
zh) - INLINECODE42 : Whisper model size (default:
medium)
Model selection
| Model | Size | Speed | Accuracy | Best for |
|---|
| tiny | 39 MB | Fastest | Fair | Quick tests |
| base |
74 MB | Fast | Good | Real-time usage |
| small | 244 MB | Medium | Good |
Recommended |
| medium | 769 MB | Slower | Very good | Higher quality |
| large | 1550 MB | Slow | Best | Professional use |
Output files
For each video, the tool generates:
- -
*.{lang}.srt - Source-language subtitles (language auto-detected, e.g. video.en.srt) - INLINECODE46 - Full transcription data with timestamps
- INLINECODE47 - Bilingual subtitles (source + target) after translation
- INLINECODE48 - Target-language-only subtitles after translation (e.g.
video.zh.srt)
Script overview
scripts/transcribe.py
Uses WhisperX for transcription and supports:
- - Automatic source language detection (or manual override via
-l) - Timestamp alignment
- Batch processing with model reuse across files
scripts/translate.py
Uses an LLM API to translate subtitles and supports:
- - Configurable target language (
-t) - Batch translation for better efficiency
- Bilingual or target-language-only output
- Custom models and API endpoints
- Automatic retry with exponential backoff on API failures
scripts/run.py
Cross-platform one-command runner that executes the transcription and translation pipeline automatically.
Paths, target language, and model size are configurable via environment variables.
视频字幕生成器
基于WhisperX构建的多语言视频字幕生成与翻译工具包。
功能特性
- - 语音转写:从视频中提取音频并转写为字幕,自动检测源语言
- 多语言翻译:将任意源语言字幕翻译为可配置的目标语言
- 双语字幕:生成源语言+目标语言的双语字幕
前置条件
- - Python 3.9+
- ffmpeg(WhisperX音频提取所需)
bash
macOS
brew install ffmpeg
Ubuntu / Debian
sudo apt install ffmpeg
Windows (Chocolatey)
choco install ffmpeg
Windows (Scoop)
scoop install ffmpeg
资源需求
运行前,请确认用户了解以下成本:
| 资源 | 详情 |
|---|
| 磁盘 | ffmpeg ~80 MB;Python包(torch、whisperx等)2–5 GB;Whisper模型权重根据模型大小不同为39 MB – 1.5 GB |
| CPU / GPU |
WhisperX在本地运行模型推理。强烈建议medium和large模型使用CUDA GPU。CPU和Apple MPS也可运行但速度明显较慢 |
|
网络 / API | 翻译步骤需调用远程LLM API并产生基于token的费用。模型下载后,转写步骤无需网络连接 |
在安装包或下载模型前务必与用户确认,因为这些操作会消耗存储空间和带宽。
翻译需要LLM API并会产生费用。 在执行翻译步骤前:
- 1. 询问用户API提供商、密钥和基础URL——或展示任何自动发现的配置供审核
- 告知用户翻译会调用远程LLM并消耗token(即实际费用)
- 在用户明确确认提供商并知晓费用前,不得进行翻译
使用方法
1. 环境设置
bash
安装依赖(PyTorch和WhisperX需要约2–5 GB磁盘空间)
pip install -r requirements.txt
设置API密钥(用于翻译)
macOS / Linux
export OPENAI
APIKEY=your-api-key
export OPENAI
BASEURL=https://openrouter.ai/api/v1 # 可选,默认为OpenRouter
Windows (PowerShell)
$env:OPENAI
APIKEY=your-api-key
$env:OPENAI
BASEURL=https://openrouter.ai/api/v1
在Windows上,以下所有命令中使用python代替python3。
2. 转写视频(自动检测语言)
bash
python3 scripts/transcribe.py /path/to/video.mp4 -o ./output -m small
输出:video.{detected_lang}.srt(例如video.en.srt、video.ja.srt)
参数:
- - -o:输出目录
- -m:模型大小(tiny、base、small、medium、large)
- -d:设备(cuda、cpu、mps),默认自动检测
- -l:强制指定源语言代码(例如en、ja、zh)。省略则自动检测
3. 批量处理目录
bash
python3 scripts/transcribe.py /path/to/video/folder -o ./output -m small
4. 翻译字幕
费用警告:此步骤会调用远程LLM API。运行前请确保用户已确认API提供商、密钥并知晓费用。
bash
翻译为中文(默认)
python3 scripts/translate.py ./output -o ./translated
翻译为日语
python3 scripts/translate.py ./output -o ./translated -t ja
仅生成双语字幕
python3 scripts/translate.py ./output -o ./translated --bilingual
参数:
- - -t、--target-lang:目标语言代码(默认:zh)
- --bilingual:生成双语(源语言+目标语言)字幕
- --target-only:仅生成目标语言字幕
- --model:翻译模型(默认:google/gemini-3-flash-preview)
- --batch-size:批处理大小(默认:10)
当未指定--bilingual或--target-only时,两者都会生成。
5. 运行完整流程
bash
python3 scripts/run.py
通过环境变量自定义
VIDEO
DIR=/path/to/videos TARGETLANG=en python3 scripts/run.py
run.py的环境变量:
- - VIDEODIR:视频源目录(默认:./videos)
- OUTPUTDIR:转写输出目录(默认:./output)
- TRANSLATEDDIR:翻译输出目录(默认:./translated)
- TARGETLANG:目标语言代码(默认:zh)
- WHISPER_MODEL:Whisper模型大小(默认:medium)
模型选择
| 模型 | 大小 | 速度 | 准确度 | 最佳用途 |
|---|
| tiny | 39 MB | 最快 | 一般 | 快速测试 |
| base |
74 MB | 快 | 良好 | 实时使用 |
| small | 244 MB | 中等 | 良好 |
推荐 |
| medium | 769 MB | 较慢 | 很好 | 更高品质 |
| large | 1550 MB | 慢 | 最佳 | 专业使用 |
输出文件
对于每个视频,工具会生成:
- - .{lang}.srt - 源语言字幕(语言自动检测,例如video.en.srt)
- .json - 包含时间戳的完整转写数据
- .bilingual.srt - 翻译后的双语字幕(源语言+目标语言)
- .{target}.srt - 翻译后仅目标语言字幕(例如video.zh.srt)
脚本概述
scripts/transcribe.py
使用WhisperX进行转写,支持:
- - 自动源语言检测(或通过-l手动指定)
- 时间戳对齐
- 跨文件复用模型的批量处理
scripts/translate.py
使用LLM API翻译字幕,支持:
- - 可配置的目标语言(-t)
- 批量翻译以提高效率
- 双语或仅目标语言输出
- 自定义模型和API端点
- API失败时自动重试并采用指数退避策略
scripts/run.py
跨平台一键运行器,自动执行转写和翻译流程。
路径、目标语言和模型大小可通过环境变量配置。