Audio Speaker Tools

Tools for speaker separation, voice comparison, and audio processing using Demucs, pyannote, and Resemblyzer.

Overview

This skill provides three main workflows:

1. Speaker separation - Extract per-speaker audio from multi-speaker recordings
Voice comparison - Measure speaker similarity between two audio files
Audio processing - Segment extraction and voice isolation

Prerequisites

Setup Virtual Environment

Run once to create the venv and install dependencies:

CODEBLOCK0

Default venv location: INLINECODE0

Requirements:

- Python 3.9+
ffmpeg (brew install ffmpeg)
HuggingFace token (set as env var HF_TOKEN)

Scripts

1. Speaker Separation: `diarize_and_slice_mps.py`

Separate speakers from multi-speaker audio:

CODEBLOCK1

Process:

1. Converts input to 16kHz mono WAV
Runs Demucs vocal/background separation (optional, for cleaner input)
Runs pyannote speaker diarization (MPS-accelerated)
Extracts concatenated per-speaker WAV files

Output:

- <prefix>_speaker1.wav, <prefix>_speaker2.wav, etc. (one per detected speaker)
INLINECODE6 (time-stamped speaker segments)
INLINECODE7 (JSON segments metadata)
INLINECODE8 (pipeline info and speaker index)

Important:

- Always pass HF token via HF_TOKEN env var, never as CLI arg
MPS first, CPU fallback - Script prefers Metal GPU, falls back to CPU if unavailable
Default output: INLINECODE10

2. Voice Comparison: `compare_voices.py`

Measure similarity between two voice samples using Resemblyzer:

CODEBLOCK2

Scores:

- < 0.75 = Different speakers
INLINECODE13 = Likely same speaker
INLINECODE14 = Excellent match (ideal for voice cloning validation)

Use cases:

- Voice clone quality assessment (compare clone vs. original)
Speaker verification (authenticate speaker identity)
Validate speaker separation (confirm separated speakers are distinct)

See: references/scoring-guide.md for detailed interpretation

3. Audio Trimming

Use ffmpeg directly for segment extraction:

CODEBLOCK3

Workflows

Workflow 1: Extract Clean Voice Sample for Cloning

Goal: Get a clean, single-speaker sample for ElevenLabs voice cloning

CODEBLOCK4

See: references/elevenlabs-cloning.md for best practices

Workflow 2: Validate Voice Clone Quality

Goal: Measure how well a cloned voice matches the original

CODEBLOCK5

See: references/scoring-guide.md for troubleshooting low scores

Workflow 3: Multi-Speaker Conversation Analysis

Goal: Separate and identify speakers in a conversation

CODEBLOCK6

Technical Notes

Device Acceleration

- pyannote diarization: MPS (Metal) by default, CPU fallback
Resemblyzer: CPU only (no GPU acceleration)
Demucs: MPS by default when available

To force CPU for diarization: INLINECODE19

Audio Formats

- Input: Any format supported by ffmpeg (wav, mp3, flac, m4a, etc.)
Processing: Internally converted to 16kHz mono WAV for diarization
Output: WAV format (44.1kHz stereo preserved from source)

HuggingFace Token

- Required for: pyannote speaker diarization
Access: Must accept gated repo pyannote/speaker-diarization-3.1 on HF
Storage: Any secure secrets manager
Usage: Always pass via HF_TOKEN env var, never CLI arg

Sample Quality Tips

- Shorter is better: 5-30s clean samples often score higher than 60+ second samples
Clean audio: Remove background noise with Demucs INLINECODE22
Single speaker: Ensure isolated voice, not mixed conversation
Good recording: Studio mic > phone mic for voice comparison accuracy

References

- elevenlabs-cloning.md - Best practices for ElevenLabs instant voice cloning (model settings, sample selection, proven configurations)
scoring-guide.md - How to interpret Resemblyzer similarity scores (thresholds, use cases, troubleshooting)

Common Issues

"Missing HF token" error

- Export token before running: INLINECODE23
Or pass inline: INLINECODE24

Low voice comparison scores for same speaker

- Try shorter, cleaner samples (5-30s)
Use Demucs to isolate vocals: INLINECODE25
Ensure consistent recording quality (same mic, environment)
See references/scoring-guide.md troubleshooting section

Diarization not detecting all speakers

- Adjust --min-speakers and --max-speakers flags
Check audio quality (clear speech, minimal overlap)
Try longer audio (30+ seconds) for better speaker modeling

MPS/Metal acceleration not working

- Ensure PyTorch with MPS support: INLINECODE29
Fallback to CPU: INLINECODE30
Re-run setup_venv.sh to reinstall PyTorch

音频说话人工具

使用Demucs、pyannote和Resemblyzer进行说话人分离、语音比较和音频处理的工具。

概述

本技能提供三个主要工作流程：

1. 说话人分离 - 从多说话人录音中提取每个说话人的音频
语音比较 - 测量两个音频文件之间的说话人相似度
音频处理 - 片段提取和语音隔离

前置条件

设置虚拟环境

运行一次以创建venv并安装依赖：

bash
bash scripts/setup_venv.sh

默认venv位置：./.venv

要求：

- Python 3.9+
ffmpeg (brew install ffmpeg)
HuggingFace令牌（设置为环境变量 HF_TOKEN）

脚本

1. 说话人分离：diarizeandslice_mps.py

从多说话人音频中分离说话人：

bash

基本用法

HF_TOKEN= \
/path/to/venv/bin/python scripts/diarizeandslice_mps.py \
--input audio.mp3 \
--outdir /path/to/output \
--prefix MyShow

带说话人限制

HFTOKEN=$TOKEN python scripts/diarizeandslicemps.py \ --input audio.mp3 \ --outdir ./out \ --min-speakers 2 \ --max-speakers 5 \ --pad-ms 100

处理流程：

1. 将输入转换为16kHz单声道WAV
运行Demucs人声/背景分离（可选，用于更干净的输入）
运行pyannote说话人日志（MPS加速）
提取每个说话人的拼接WAV文件

输出：

- speaker1.wav、speaker2.wav等（每个检测到的说话人一个文件）
diarization.rttm（带时间戳的说话人片段）
segments.jsonl（JSON片段元数据）
meta.json（管道信息和说话人索引）

重要提示：

- 始终通过 HF_TOKEN 环境变量传递HF令牌，切勿作为CLI参数
优先MPS，CPU回退 - 脚本优先使用Metal GPU，不可用时回退到CPU
默认输出：./separated/

2. 语音比较：compare_voices.py

使用Resemblyzer测量两个语音样本之间的相似度：

bash

基本比较

python scripts/compare_voices.py \
--audio1 sample1.wav \
--audio2 sample2.wav

JSON输出

python scripts/compare_voices.py \ --audio1 reference.wav \ --audio2 clone.wav \ --threshold 0.85 \ --json

退出码=0表示通过，=1表示失败

评分：

- < 0.75 = 不同说话人
0.75-0.84 = 可能是同一说话人
0.85+ = 极佳匹配（语音克隆验证的理想值）

使用场景：

- 语音克隆质量评估（比较克隆与原始）
说话人验证（验证说话人身份）
验证说话人分离（确认分离的说话人是不同的）

参见： references/scoring-guide.md 获取详细解释

3. 音频修剪

直接使用 ffmpeg 进行片段提取：

bash

提取从5秒开始的10秒片段

ffmpeg -i input.mp3 -ss 5 -t 10 -c copy output.mp3

使用Demucs仅提取人声（在日志分析之前）

demucs --two-stems vocals --out ./separated input.mp3

工作流程

工作流程1：提取用于克隆的干净语音样本

目标： 为ElevenLabs语音克隆获取干净的单一说话人样本

bash

1. 分离说话人

HFTOKEN= python scripts/diarizeandslicemps.py \
--input podcast.mp3 --outdir ./out --prefix Podcast

2. 查看说话人文件（out/Podcast_speaker1.wav等）

3. 选择最佳样本（5-30秒，清晰语音）

ffmpeg -i out/Podcast_speaker2.wav -ss 10 -t 20 -c copy sample.wav

4. 上传到ElevenLabs作为即时语音克隆

参见： references/elevenlabs-cloning.md 获取最佳实践

工作流程2：验证语音克隆质量

目标： 测量克隆语音与原始语音的匹配程度

bash

1. 使用ElevenLabs克隆生成测试音频

（通过ElevenLabs网页界面或API完成）

2. 比较克隆与参考

python scripts/compare_voices.py \ --audio1 original_sample.wav \ --audio2 elevenlabs_clone.wav \ --threshold 0.85 \ --json

3. 解释评分：

0.85+ = 极佳，可发布

0.80-0.84 = 可接受，可能需要调整

< 0.80 = 较差，尝试不同样本或设置

参见： references/scoring-guide.md 获取低分故障排除

工作流程3：多说话人对话分析

目标： 分离并识别对话中的说话人

bash

1. 运行日志分析

HFTOKEN=$TOKEN python scripts/diarizeandslicemps.py \
--input meeting.mp3 --outdir ./out --prefix Meeting

2. 检查检测到的说话人（meta.json）

cat out/meta.json

3. 比较说话人对以确认分离

python scripts/compare_voices.py \ --audio1 out/Meeting_speaker1.wav \ --audio2 out/Meeting_speaker2.wav

预期：如果分离正确，结果应 < 0.75

技术说明

设备加速

- pyannote日志分析： 默认MPS（Metal），CPU回退
Resemblyzer： 仅CPU（无GPU加速）
Demucs： 可用时默认MPS

强制使用CPU进行日志分析：--device cpu

音频格式

- 输入： ffmpeg支持的任何格式（wav、mp3、flac、m4a等）
处理： 内部转换为16kHz单声道WAV用于日志分析
输出： WAV格式（保留源文件的44.1kHz立体声）

HuggingFace令牌

- 需要用于： pyannote说话人日志分析
访问： 必须在HF上接受受限仓库 pyannote/speaker-diarization-3.1
存储： 任何安全的密钥管理器
使用： 始终通过 HF_TOKEN 环境变量传递，切勿使用CLI参数

样本质量提示

- 越短越好： 5-30秒的干净样本通常比60秒以上的样本得分更高
干净音频： 使用Demucs --two-stems vocals 去除背景噪音
单一说话人： 确保是隔离的语音，而非混合对话
良好录音： 语音比较准确性方面，录音室麦克风优于手机麦克风

参考资料

- elevenlabs-cloning.md - ElevenLabs即时语音克隆的最佳实践（模型设置、样本选择、已验证配置）
scoring-guide.md - 如何解释Resemblyzer相似度评分（阈值、使用场景、故障排除）

常见问题

缺少HF令牌错误

- 运行前导出令牌：export HFTOKEN=
或内联传递：HFTOKEN= python script.py ...

同一说话人的语音比较得分低

- 尝试更短、更干净的样本（5-30秒）
使用Demucs隔离人声：demucs --two-stems vocals input.mp3
确保一致的录音质量（相同麦克风、环境）
参见 references/scoring-guide.md 故障排除部分

日志分析未检测到所有说话人

- 调整 --min-speakers 和 --max-speakers 标志
检查音频质量（清晰语音、最小重叠）
尝试更长的音频（30秒以上）以获得更好的说话人建模

MPS/Metal加速不工作

- 确保PyTorch支持MPS：python -c import torch; print(torch.backends.mps.is_available())
回退到CPU：--device cpu

audio-speaker-tools音频说话人工具