Video Subtitle Generator

Multilingual video subtitle generation and translation toolkit built on WhisperX.

Features

- Speech transcription: Extract audio from video and transcribe it into subtitles with automatic source language detection
Multilingual translation: Translate subtitles from any source language into a configurable target language
Bilingual subtitles: Generate source + target bilingual subtitles

Prerequisites

- Python 3.9+
ffmpeg (required by WhisperX for audio extraction)

CODEBLOCK0

Resource requirements

Before running, confirm the user is aware of the following costs:

Resource	Details
Disk	ffmpeg ~80 MB; Python packages (torch, whisperx, etc.) 2–5 GB; Whisper model weights 39 MB – 1.5 GB depending on model size
CPU / GPU

WhisperX runs model inference locally. A CUDA GPU is strongly recommended for medium and large models. CPU and Apple MPS also work but are significantly slower |
| Network / API | Translation step calls a remote LLM API and incurs token-based charges. No network is needed for the transcription step once the model is downloaded |

Always confirm with the user before installing packages or downloading models, as these operations consume storage and bandwidth.

Translation requires an LLM API and will incur costs. Before executing the translation step:

1. Ask the user for the API provider, key, and base URL — or present any auto-discovered configuration for review
Inform the user that translation calls a remote LLM and will consume tokens (i.e. real money)
Do NOT proceed with translation until the user explicitly confirms the provider and acknowledges the cost

Usage

1. Environment setup

CODEBLOCK1

On Windows, use python instead of python3 in all commands below.

2. Transcribe video (auto-detect language)

CODEBLOCK2

Output: video.{detected_lang}.srt (e.g. video.en.srt, video.ja.srt)

Arguments:

- -o: Output directory
INLINECODE8: Model size (tiny, base, small, medium, large)
INLINECODE14: Device (cuda, cpu, mps), auto-detected by default
INLINECODE18: Force source language code (e.g. en, ja, zh). Auto-detect if omitted

3. Batch-process a directory

CODEBLOCK3

4. Translate subtitles

Cost warning: This step calls a remote LLM API. Ensure the user has confirmed the API provider, key, and billing awareness before running.

CODEBLOCK4

Arguments:

- -t, --target-lang: Target language code (default: zh)
INLINECODE25: Generate bilingual (source + target) subtitles
INLINECODE26: Generate target-language-only subtitles
INLINECODE27: Translation model (default: google/gemini-3-flash-preview)
INLINECODE29: Batch size (default: 10)

When neither --bilingual nor --target-only is specified, both are generated.

5. Run the full pipeline

CODEBLOCK5

Environment variables for run.py:

- VIDEO_DIR: Video source directory (default: ./videos)
INLINECODE36: Transcription output directory (default: ./output)
INLINECODE38: Translation output directory (default: ./translated)
INLINECODE40: Target language code (default: zh)
INLINECODE42: Whisper model size (default: medium)

Model selection

Model	Size	Speed	Accuracy	Best for
tiny	39 MB	Fastest	Fair	Quick tests
base

Output files

For each video, the tool generates:

- *.{lang}.srt - Source-language subtitles (language auto-detected, e.g. video.en.srt)
INLINECODE46 - Full transcription data with timestamps
INLINECODE47 - Bilingual subtitles (source + target) after translation
INLINECODE48 - Target-language-only subtitles after translation (e.g. video.zh.srt)

Script overview

scripts/transcribe.py

Uses WhisperX for transcription and supports:

- Automatic source language detection (or manual override via -l)
Timestamp alignment
Batch processing with model reuse across files

scripts/translate.py

Uses an LLM API to translate subtitles and supports:

- Configurable target language (-t)
Batch translation for better efficiency
Bilingual or target-language-only output
Custom models and API endpoints
Automatic retry with exponential backoff on API failures

scripts/run.py

Cross-platform one-command runner that executes the transcription and translation pipeline automatically.
Paths, target language, and model size are configurable via environment variables.

视频字幕生成器

基于WhisperX构建的多语言视频字幕生成与翻译工具包。

功能特性

- 语音转写：从视频中提取音频并转写为字幕，自动检测源语言
多语言翻译：将任意源语言字幕翻译为可配置的目标语言
双语字幕：生成源语言+目标语言的双语字幕

前置条件

- Python 3.9+
ffmpeg（WhisperX音频提取所需）

bash

macOS

brew install ffmpeg

Ubuntu / Debian

sudo apt install ffmpeg

Windows (Chocolatey)

choco install ffmpeg

Windows (Scoop)

scoop install ffmpeg

资源需求

运行前，请确认用户了解以下成本：

资源	详情
磁盘	ffmpeg ~80 MB；Python包（torch、whisperx等）2–5 GB；Whisper模型权重根据模型大小不同为39 MB – 1.5 GB
CPU / GPU

WhisperX在本地运行模型推理。强烈建议medium和large模型使用CUDA GPU。CPU和Apple MPS也可运行但速度明显较慢 |
| 网络 / API | 翻译步骤需调用远程LLM API并产生基于token的费用。模型下载后，转写步骤无需网络连接 |

在安装包或下载模型前务必与用户确认，因为这些操作会消耗存储空间和带宽。

翻译需要LLM API并会产生费用。 在执行翻译步骤前：

1. 询问用户API提供商、密钥和基础URL——或展示任何自动发现的配置供审核
告知用户翻译会调用远程LLM并消耗token（即实际费用）
在用户明确确认提供商并知晓费用前，不得进行翻译

使用方法

1. 环境设置

bash

安装依赖（PyTorch和WhisperX需要约2–5 GB磁盘空间）

pip install -r requirements.txt

设置API密钥（用于翻译）

macOS / Linux

export OPENAIAPIKEY=your-api-key export OPENAIBASEURL=https://openrouter.ai/api/v1 # 可选，默认为OpenRouter

Windows (PowerShell)

$env:OPENAIAPIKEY=your-api-key $env:OPENAIBASEURL=https://openrouter.ai/api/v1

在Windows上，以下所有命令中使用python代替python3。

2. 转写视频（自动检测语言）

bash
python3 scripts/transcribe.py /path/to/video.mp4 -o ./output -m small

输出：video.{detected_lang}.srt（例如video.en.srt、video.ja.srt）

参数：

- -o：输出目录
-m：模型大小（tiny、base、small、medium、large）
-d：设备（cuda、cpu、mps），默认自动检测
-l：强制指定源语言代码（例如en、ja、zh）。省略则自动检测

3. 批量处理目录

bash
python3 scripts/transcribe.py /path/to/video/folder -o ./output -m small

4. 翻译字幕

费用警告：此步骤会调用远程LLM API。运行前请确保用户已确认API提供商、密钥并知晓费用。

bash

翻译为中文（默认）

python3 scripts/translate.py ./output -o ./translated

翻译为日语

python3 scripts/translate.py ./output -o ./translated -t ja

仅生成双语字幕

python3 scripts/translate.py ./output -o ./translated --bilingual

参数：

- -t、--target-lang：目标语言代码（默认：zh）
--bilingual：生成双语（源语言+目标语言）字幕
--target-only：仅生成目标语言字幕
--model：翻译模型（默认：google/gemini-3-flash-preview）
--batch-size：批处理大小（默认：10）

当未指定--bilingual或--target-only时，两者都会生成。

5. 运行完整流程

bash
python3 scripts/run.py

通过环境变量自定义

VIDEODIR=/path/to/videos TARGETLANG=en python3 scripts/run.py

run.py的环境变量：

- VIDEODIR：视频源目录（默认：./videos）
OUTPUTDIR：转写输出目录（默认：./output）
TRANSLATEDDIR：翻译输出目录（默认：./translated）
TARGETLANG：目标语言代码（默认：zh）
WHISPER_MODEL：Whisper模型大小（默认：medium）

模型选择

模型	大小	速度	准确度	最佳用途
tiny	39 MB	最快	一般	快速测试
base

74 MB | 快 | 良好 | 实时使用 | | small | 244 MB | 中等 | 良好 | 推荐 | | medium | 769 MB | 较慢 | 很好 | 更高品质 | | large | 1550 MB | 慢 | 最佳 | 专业使用 |

输出文件

对于每个视频，工具会生成：

- .{lang}.srt - 源语言字幕（语言自动检测，例如video.en.srt）
.json - 包含时间戳的完整转写数据
.bilingual.srt - 翻译后的双语字幕（源语言+目标语言）
.{target}.srt - 翻译后仅目标语言字幕（例如video.zh.srt）

脚本概述

scripts/transcribe.py

使用WhisperX进行转写，支持：

- 自动源语言检测（或通过-l手动指定）
时间戳对齐
跨文件复用模型的批量处理

scripts/translate.py

使用LLM API翻译字幕，支持：

- 可配置的目标语言（-t）
批量翻译以提高效率
双语或仅目标语言输出
自定义模型和API端点
API失败时自动重试并采用指数退避策略

scripts/run.py

跨平台一键运行器，自动执行转写和翻译流程。
路径、目标语言和模型大小可通过环境变量配置。

video-subtitle-generator视频字幕生成