Audio Speaker Tools
Tools for speaker separation, voice comparison, and audio processing using Demucs, pyannote, and Resemblyzer.
Overview
This skill provides three main workflows:
- 1. Speaker separation - Extract per-speaker audio from multi-speaker recordings
- Voice comparison - Measure speaker similarity between two audio files
- Audio processing - Segment extraction and voice isolation
Prerequisites
Setup Virtual Environment
Run once to create the venv and install dependencies:
CODEBLOCK0
Default venv location: INLINECODE0
Requirements:
- - Python 3.9+
- ffmpeg (
brew install ffmpeg) - HuggingFace token (set as env var
HF_TOKEN)
Scripts
1. Speaker Separation: diarize_and_slice_mps.py
Separate speakers from multi-speaker audio:
CODEBLOCK1
Process:
- 1. Converts input to 16kHz mono WAV
- Runs Demucs vocal/background separation (optional, for cleaner input)
- Runs pyannote speaker diarization (MPS-accelerated)
- Extracts concatenated per-speaker WAV files
Output:
- -
<prefix>_speaker1.wav, <prefix>_speaker2.wav, etc. (one per detected speaker) - INLINECODE6 (time-stamped speaker segments)
- INLINECODE7 (JSON segments metadata)
- INLINECODE8 (pipeline info and speaker index)
Important:
- - Always pass HF token via
HF_TOKEN env var, never as CLI arg - MPS first, CPU fallback - Script prefers Metal GPU, falls back to CPU if unavailable
- Default output: INLINECODE10
2. Voice Comparison: compare_voices.py
Measure similarity between two voice samples using Resemblyzer:
CODEBLOCK2
Scores:
- -
< 0.75 = Different speakers - INLINECODE13 = Likely same speaker
- INLINECODE14 = Excellent match (ideal for voice cloning validation)
Use cases:
- - Voice clone quality assessment (compare clone vs. original)
- Speaker verification (authenticate speaker identity)
- Validate speaker separation (confirm separated speakers are distinct)
See: references/scoring-guide.md for detailed interpretation
3. Audio Trimming
Use ffmpeg directly for segment extraction:
CODEBLOCK3
Workflows
Workflow 1: Extract Clean Voice Sample for Cloning
Goal: Get a clean, single-speaker sample for ElevenLabs voice cloning
CODEBLOCK4
See: references/elevenlabs-cloning.md for best practices
Workflow 2: Validate Voice Clone Quality
Goal: Measure how well a cloned voice matches the original
CODEBLOCK5
See: references/scoring-guide.md for troubleshooting low scores
Workflow 3: Multi-Speaker Conversation Analysis
Goal: Separate and identify speakers in a conversation
CODEBLOCK6
Technical Notes
Device Acceleration
- - pyannote diarization: MPS (Metal) by default, CPU fallback
- Resemblyzer: CPU only (no GPU acceleration)
- Demucs: MPS by default when available
To force CPU for diarization: INLINECODE19
Audio Formats
- - Input: Any format supported by ffmpeg (wav, mp3, flac, m4a, etc.)
- Processing: Internally converted to 16kHz mono WAV for diarization
- Output: WAV format (44.1kHz stereo preserved from source)
HuggingFace Token
- - Required for: pyannote speaker diarization
- Access: Must accept gated repo
pyannote/speaker-diarization-3.1 on HF - Storage: Any secure secrets manager
- Usage: Always pass via
HF_TOKEN env var, never CLI arg
Sample Quality Tips
- - Shorter is better: 5-30s clean samples often score higher than 60+ second samples
- Clean audio: Remove background noise with Demucs INLINECODE22
- Single speaker: Ensure isolated voice, not mixed conversation
- Good recording: Studio mic > phone mic for voice comparison accuracy
References
- - elevenlabs-cloning.md - Best practices for ElevenLabs instant voice cloning (model settings, sample selection, proven configurations)
- scoring-guide.md - How to interpret Resemblyzer similarity scores (thresholds, use cases, troubleshooting)
Common Issues
"Missing HF token" error
- - Export token before running: INLINECODE23
- Or pass inline: INLINECODE24
Low voice comparison scores for same speaker
- - Try shorter, cleaner samples (5-30s)
- Use Demucs to isolate vocals: INLINECODE25
- Ensure consistent recording quality (same mic, environment)
- See
references/scoring-guide.md troubleshooting section
Diarization not detecting all speakers
- - Adjust
--min-speakers and --max-speakers flags - Check audio quality (clear speech, minimal overlap)
- Try longer audio (30+ seconds) for better speaker modeling
MPS/Metal acceleration not working
- - Ensure PyTorch with MPS support: INLINECODE29
- Fallback to CPU: INLINECODE30
- Re-run
setup_venv.sh to reinstall PyTorch
音频说话人工具
使用Demucs、pyannote和Resemblyzer进行说话人分离、语音比较和音频处理的工具。
概述
本技能提供三个主要工作流程:
- 1. 说话人分离 - 从多说话人录音中提取每个说话人的音频
- 语音比较 - 测量两个音频文件之间的说话人相似度
- 音频处理 - 片段提取和语音隔离
前置条件
设置虚拟环境
运行一次以创建venv并安装依赖:
bash
bash scripts/setup_venv.sh
默认venv位置:./.venv
要求:
- - Python 3.9+
- ffmpeg (brew install ffmpeg)
- HuggingFace令牌(设置为环境变量 HF_TOKEN)
脚本
1. 说话人分离:diarizeandslice_mps.py
从多说话人音频中分离说话人:
bash
基本用法
HF_TOKEN=
\
/path/to/venv/bin/python scripts/diarizeandslice_mps.py \
--input audio.mp3 \
--outdir /path/to/output \
--prefix MyShow
带说话人限制
HFTOKEN=$TOKEN python scripts/diarizeandslicemps.py \
--input audio.mp3 \
--outdir ./out \
--min-speakers 2 \
--max-speakers 5 \
--pad-ms 100
处理流程:
- 1. 将输入转换为16kHz单声道WAV
- 运行Demucs人声/背景分离(可选,用于更干净的输入)
- 运行pyannote说话人日志(MPS加速)
- 提取每个说话人的拼接WAV文件
输出:
- - speaker1.wav、speaker2.wav等(每个检测到的说话人一个文件)
- diarization.rttm(带时间戳的说话人片段)
- segments.jsonl(JSON片段元数据)
- meta.json(管道信息和说话人索引)
重要提示:
- - 始终通过 HF_TOKEN 环境变量传递HF令牌,切勿作为CLI参数
- 优先MPS,CPU回退 - 脚本优先使用Metal GPU,不可用时回退到CPU
- 默认输出:./separated/
2. 语音比较:compare_voices.py
使用Resemblyzer测量两个语音样本之间的相似度:
bash
基本比较
python scripts/compare_voices.py \
--audio1 sample1.wav \
--audio2 sample2.wav
JSON输出
python scripts/compare_voices.py \
--audio1 reference.wav \
--audio2 clone.wav \
--threshold 0.85 \
--json
退出码=0表示通过,=1表示失败
评分:
- - < 0.75 = 不同说话人
- 0.75-0.84 = 可能是同一说话人
- 0.85+ = 极佳匹配(语音克隆验证的理想值)
使用场景:
- - 语音克隆质量评估(比较克隆与原始)
- 说话人验证(验证说话人身份)
- 验证说话人分离(确认分离的说话人是不同的)
参见: references/scoring-guide.md 获取详细解释
3. 音频修剪
直接使用 ffmpeg 进行片段提取:
bash
提取从5秒开始的10秒片段
ffmpeg -i input.mp3 -ss 5 -t 10 -c copy output.mp3
使用Demucs仅提取人声(在日志分析之前)
demucs --two-stems vocals --out ./separated input.mp3
工作流程
工作流程1:提取用于克隆的干净语音样本
目标: 为ElevenLabs语音克隆获取干净的单一说话人样本
bash
1. 分离说话人
HFTOKEN= python scripts/diarizeandslicemps.py \
--input podcast.mp3 --outdir ./out --prefix Podcast
2. 查看说话人文件(out/Podcast_speaker1.wav等)
3. 选择最佳样本(5-30秒,清晰语音)
ffmpeg -i out/Podcast_speaker2.wav -ss 10 -t 20 -c copy sample.wav
4. 上传到ElevenLabs作为即时语音克隆
参见: references/elevenlabs-cloning.md 获取最佳实践
工作流程2:验证语音克隆质量
目标: 测量克隆语音与原始语音的匹配程度
bash
1. 使用ElevenLabs克隆生成测试音频
(通过ElevenLabs网页界面或API完成)
2. 比较克隆与参考
python scripts/compare_voices.py \
--audio1 original_sample.wav \
--audio2 elevenlabs_clone.wav \
--threshold 0.85 \
--json
3. 解释评分:
0.85+ = 极佳,可发布
0.80-0.84 = 可接受,可能需要调整
< 0.80 = 较差,尝试不同样本或设置
参见: references/scoring-guide.md 获取低分故障排除
工作流程3:多说话人对话分析
目标: 分离并识别对话中的说话人
bash
1. 运行日志分析
HFTOKEN=$TOKEN python scripts/diarizeandslicemps.py \
--input meeting.mp3 --outdir ./out --prefix Meeting
2. 检查检测到的说话人(meta.json)
cat out/meta.json
3. 比较说话人对以确认分离
python scripts/compare_voices.py \
--audio1 out/Meeting_speaker1.wav \
--audio2 out/Meeting_speaker2.wav
预期:如果分离正确,结果应 < 0.75
技术说明
设备加速
- - pyannote日志分析: 默认MPS(Metal),CPU回退
- Resemblyzer: 仅CPU(无GPU加速)
- Demucs: 可用时默认MPS
强制使用CPU进行日志分析:--device cpu
音频格式
- - 输入: ffmpeg支持的任何格式(wav、mp3、flac、m4a等)
- 处理: 内部转换为16kHz单声道WAV用于日志分析
- 输出: WAV格式(保留源文件的44.1kHz立体声)
HuggingFace令牌
- - 需要用于: pyannote说话人日志分析
- 访问: 必须在HF上接受受限仓库 pyannote/speaker-diarization-3.1
- 存储: 任何安全的密钥管理器
- 使用: 始终通过 HF_TOKEN 环境变量传递,切勿使用CLI参数
样本质量提示
- - 越短越好: 5-30秒的干净样本通常比60秒以上的样本得分更高
- 干净音频: 使用Demucs --two-stems vocals 去除背景噪音
- 单一说话人: 确保是隔离的语音,而非混合对话
- 良好录音: 语音比较准确性方面,录音室麦克风优于手机麦克风
参考资料
- - elevenlabs-cloning.md - ElevenLabs即时语音克隆的最佳实践(模型设置、样本选择、已验证配置)
- scoring-guide.md - 如何解释Resemblyzer相似度评分(阈值、使用场景、故障排除)
常见问题
缺少HF令牌错误
- - 运行前导出令牌:export HFTOKEN=
- 或内联传递:HFTOKEN= python script.py ...
同一说话人的语音比较得分低
- - 尝试更短、更干净的样本(5-30秒)
- 使用Demucs隔离人声:demucs --two-stems vocals input.mp3
- 确保一致的录音质量(相同麦克风、环境)
- 参见 references/scoring-guide.md 故障排除部分
日志分析未检测到所有说话人
- - 调整 --min-speakers 和 --max-speakers 标志
- 检查音频质量(清晰语音、最小重叠)
- 尝试更长的音频(30秒以上)以获得更好的说话人建模
MPS/Metal加速不工作
- - 确保PyTorch支持MPS:python -c import torch; print(torch.backends.mps.is_available())
- 回退到CPU:--device cpu