Qwen-Audio
Overview
Qwen-Audio is a high-performance audio processing library optimized. It delivers fast, efficient TTS and STT with support for multiple models, languages, and audio formats.
Prerequisites
Environment checks
Before using any capability, verify that all items in ./references/env-check-list.md are complete.
Capabilities
Voice Management
Voices are stored in the ./voices/ directory at the skill root level. Each voice has its own folder containing:
- -
ref_audio.wav - Reference audio file - INLINECODE3 - Reference text transcript
- INLINECODE4 - Voice style description
Create a Voice
Create a reusable voice profile using VoiceDesign model. The
--instruct parameter is required to describe the voice style:
uv run --project "/<qwen-audio-skill-path>" python "<qwen-audio-skill-path>/scripts/qwen-audio.py" voice create --text "This is a sample voice reference text." --instruct "A warm, friendly female voice with a professional tone." --id "my-voice-id"
Optional:
--id "my-voice-id" to specify a custom voice ID.
Returns (JSON):
CODEBLOCK1
List Voices
List all created voice profiles:
CODEBLOCK2
Returns (JSON):
CODEBLOCK3
Text to Speech
TTS Voice Pre-check (Required)
Before any
tts generation, always confirm the available voices first:
- 1. Run
voice list to check the current voice profiles. - If the returned list is empty, stop and ask the user what kind of voice they want to create first. Offer style choices, for example:
- Warm and friendly female narrator
- Deep and steady male broadcast voice
- Young and energetic neutral voice
- Calm and professional customer-service voice
Then run
voice create only after the user confirms a style.
- 3. If the returned list is not empty, show the available voice
id values and ask the user to confirm which one should be used as the --ref_voice reference id for generation.
Only run tts after this confirmation step is complete.
CODEBLOCK4
Returns (JSON):
CODEBLOCK5
Voice Cloning
Clone any voice using a reference audio sample. Provide the wav file and its transcript:
uv run --project "/<qwen-audio-skill-path>" python "<qwen-audio-skill-path>/scripts/qwen-audio.py" tts --text "hello world" --output "/path/to/save.wav" --ref_audio "sample_audio.wav" --ref_text "This is what my voice sounds like."
ref_audio: reference audio to clone
ref_text: transcript of the reference audio
Use a Created Voice
After creating a voice, use it for TTS with the
--ref_voice parameter. The instruct will be automatically loaded:
uv run --project "/<qwen-audio-skill-path>" python "<qwen-audio-skill-path>/scripts/qwen-audio.py" tts --text "New text to speak" --output "/path/to/save.wav" --ref_voice "my-voice-id" --instruct "Very happy and excited."
Optional:
--instruct to emotion control.
Automatic Speech Recognition (STT)
uv run --project "/<qwen-audio-skill-path>" python "<qwen-audio-skill-path>/scripts/qwen-audio.py" stt --audio "/sample_audio.wav" --output "/path/to/save.txt" --output-format txt
Test audio: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav
output-format: "txt" | "ass" | "srt" | "all"
Returns (JSON):
CODEBLOCK9
Qwen-Audio
概述
Qwen-Audio 是一个高性能音频处理库,经过优化后能够提供快速高效的文本转语音(TTS)和语音转文本(STT)功能,支持多种模型、语言和音频格式。
前置条件
环境检查
在使用任何功能之前,请确认 ./references/env-check-list.md 中的所有项目均已就绪。
功能
语音管理
语音文件存储在技能根目录下的 ./voices/ 文件夹中。每个语音拥有独立的文件夹,包含以下内容:
- - refaudio.wav - 参考音频文件
- reftext.txt - 参考文本转录
- ref_instruct.txt - 语音风格描述
创建语音
使用 VoiceDesign 模型创建可复用的语音配置文件。--instruct 参数为必填项,用于描述语音风格:
bash
uv run --project / python /scripts/qwen-audio.py voice create --text 这是一个示例语音参考文本。 --instruct 温暖友好的女性声音,带有专业语调。 --id my-voice-id
可选参数:--id my-voice-id 用于指定自定义语音 ID。
返回结果(JSON):
json
{
id: my-voice-id,
refaudio: //voices/my-voice-id/refaudio.wav,
ref_text: 这是一个示例语音参考文本。,
instruct: 温暖友好的女性声音,带有专业语调。,
duration: 3.456,
sample_rate: 24000,
success: true
}
列出语音
列出所有已创建的语音配置文件:
bash
uv run --project / python /scripts/qwen-audio.py voice list
返回结果(JSON):
json
[
{
id: my-voice-id,
refaudio: //voices/my-voice-id/refaudio.wav,
ref_text: 这是一个示例语音参考文本。,
instruct: 温暖友好的女性声音,带有专业语调。,
duration: 3.456,
sample_rate: 24000
}
]
文本转语音
TTS 语音预检查(必需)
在执行任何 tts 生成之前,务必先确认可用的语音:
- 1. 运行 voice list 检查当前的语音配置文件。
- 如果返回的列表为空,请停止操作并询问用户希望创建何种类型的语音。提供风格选择,例如:
- 温暖友好的女性旁白
- 深沉稳重的男声播音
- 年轻活力的中性声音
- 冷静专业的客服语音
待用户确认风格后再运行 voice create。
- 3. 如果返回的列表不为空,显示可用的语音 id 值,并请用户确认使用哪一个作为 --ref_voice 参考 ID 进行生成。
仅在完成此确认步骤后才能运行 tts。
bash
uv run --project / python /scripts/qwen-audio.py tts --text 你好世界 --output /path/to/save.wav
返回结果(JSON):
json
{
audio_path: /path/to/save.wav,
duration: 1.234,
sample_rate: 24000,
success: true
}
语音克隆
使用参考音频样本克隆任意语音。提供 wav 文件及其转录文本:
bash
uv run --project / python /scripts/qwen-audio.py tts --text 你好世界 --output /path/to/save.wav --refaudio sampleaudio.wav --ref_text 这是我的声音听起来的样子。
ref_audio:用于克隆的参考音频
ref_text:参考音频的转录文本
使用已创建的语音
创建语音后,使用 --ref_voice 参数进行 TTS。指令将自动加载:
bash
uv run --project / python /scripts/qwen-audio.py tts --text 需要朗读的新文本 --output /path/to/save.wav --ref_voice my-voice-id --instruct 非常开心和兴奋。
可选参数:--instruct 用于情感控制。
自动语音识别(STT)
bash
uv run --project /
python /scripts/qwen-audio.py stt --audio /sample_audio.wav --output /path/to/save.txt --output-format txt
测试音频:https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav
output-format:txt | ass | srt | all
返回结果(JSON):
json
{
text: 转录后的文本内容,
duration: 10.5,
sample_rate: 16000,
files: [/path/to/save.txt, /path/to/save.srt],
success: true
}