Qwen-Audio

Overview

Qwen-Audio is a high-performance audio processing library optimized. It delivers fast, efficient TTS and STT with support for multiple models, languages, and audio formats.

Prerequisites

- Python 3.10+

Environment checks

Before using any capability, verify that all items in ./references/env-check-list.md are complete.

Capabilities

Voice Management

Voices are stored in the ./voices/ directory at the skill root level. Each voice has its own folder containing:

- ref_audio.wav - Reference audio file
INLINECODE3 - Reference text transcript
INLINECODE4 - Voice style description

Create a Voice

Create a reusable voice profile using VoiceDesign model. The --instruct parameter is required to describe the voice style:

uv run --project "/<qwen-audio-skill-path>" python "<qwen-audio-skill-path>/scripts/qwen-audio.py" voice create --text "This is a sample voice reference text." --instruct "A warm, friendly female voice with a professional tone." --id "my-voice-id"

Optional: --id "my-voice-id" to specify a custom voice ID.

Returns (JSON):
CODEBLOCK1

List Voices

List all created voice profiles: CODEBLOCK2

Returns (JSON):
CODEBLOCK3

Text to Speech

TTS Voice Pre-check (Required)

Before any tts generation, always confirm the available voices first:

1. Run voice list to check the current voice profiles.
If the returned list is empty, stop and ask the user what kind of voice they want to create first. Offer style choices, for example:

- Warm and friendly female narrator - Deep and steady male broadcast voice - Young and energetic neutral voice - Calm and professional customer-service voice Then run voice create only after the user confirms a style.

3. If the returned list is not empty, show the available voice id values and ask the user to confirm which one should be used as the --ref_voice reference id for generation.

Only run tts after this confirmation step is complete.

CODEBLOCK4
Returns (JSON):
CODEBLOCK5

Voice Cloning

Clone any voice using a reference audio sample. Provide the wav file and its transcript:

uv run --project "/<qwen-audio-skill-path>" python "<qwen-audio-skill-path>/scripts/qwen-audio.py" tts --text "hello world" --output "/path/to/save.wav" --ref_audio "sample_audio.wav" --ref_text "This is what my voice sounds like."

ref_audio: reference audio to clone ref_text: transcript of the reference audio

Use a Created Voice

After creating a voice, use it for TTS with the --ref_voice parameter. The instruct will be automatically loaded:

uv run --project "/<qwen-audio-skill-path>" python "<qwen-audio-skill-path>/scripts/qwen-audio.py" tts --text "New text to speak" --output "/path/to/save.wav" --ref_voice "my-voice-id" --instruct "Very happy and excited."

Optional: --instruct to emotion control.

Automatic Speech Recognition (STT)

uv run --project "/<qwen-audio-skill-path>" python "<qwen-audio-skill-path>/scripts/qwen-audio.py" stt --audio "/sample_audio.wav" --output "/path/to/save.txt" --output-format txt

Test audio: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav output-format: "txt" | "ass" | "srt" | "all"

Returns (JSON):
CODEBLOCK9

Qwen-Audio

概述

Qwen-Audio 是一个高性能音频处理库，经过优化后能够提供快速高效的文本转语音（TTS）和语音转文本（STT）功能，支持多种模型、语言和音频格式。

前置条件

- Python 3.10 及以上版本

环境检查

在使用任何功能之前，请确认 ./references/env-check-list.md 中的所有项目均已就绪。

功能

语音管理

语音文件存储在技能根目录下的 ./voices/ 文件夹中。每个语音拥有独立的文件夹，包含以下内容：

- refaudio.wav - 参考音频文件
reftext.txt - 参考文本转录
ref_instruct.txt - 语音风格描述

创建语音

使用 VoiceDesign 模型创建可复用的语音配置文件。--instruct 参数为必填项，用于描述语音风格：
bash
uv run --project / python /scripts/qwen-audio.py voice create --text 这是一个示例语音参考文本。 --instruct 温暖友好的女性声音，带有专业语调。 --id my-voice-id

可选参数：--id my-voice-id 用于指定自定义语音 ID。

返回结果（JSON）：
json
{
id: my-voice-id,
refaudio: //voices/my-voice-id/refaudio.wav,
ref_text: 这是一个示例语音参考文本。,
instruct: 温暖友好的女性声音，带有专业语调。,
duration: 3.456,
sample_rate: 24000,
success: true
}

列出语音

列出所有已创建的语音配置文件：
bash
uv run --project / python /scripts/qwen-audio.py voice list

返回结果（JSON）：
json
[
{
id: my-voice-id,
refaudio: //voices/my-voice-id/refaudio.wav,
ref_text: 这是一个示例语音参考文本。,
instruct: 温暖友好的女性声音，带有专业语调。,
duration: 3.456,
sample_rate: 24000
}
]

文本转语音

TTS 语音预检查（必需）

在执行任何 tts 生成之前，务必先确认可用的语音：

1. 运行 voice list 检查当前的语音配置文件。
如果返回的列表为空，请停止操作并询问用户希望创建何种类型的语音。提供风格选择，例如：

- 温暖友好的女性旁白 - 深沉稳重的男声播音 - 年轻活力的中性声音 - 冷静专业的客服语音待用户确认风格后再运行 voice create。

3. 如果返回的列表不为空，显示可用的语音 id 值，并请用户确认使用哪一个作为 --ref_voice 参考 ID 进行生成。

仅在完成此确认步骤后才能运行 tts。

bash
uv run --project / python /scripts/qwen-audio.py tts --text 你好世界 --output /path/to/save.wav

返回结果（JSON）：
json
{
audio_path: /path/to/save.wav,
duration: 1.234,
sample_rate: 24000,
success: true
}

语音克隆

使用参考音频样本克隆任意语音。提供 wav 文件及其转录文本：
bash
uv run --project / python /scripts/qwen-audio.py tts --text 你好世界 --output /path/to/save.wav --refaudio sampleaudio.wav --ref_text 这是我的声音听起来的样子。

ref_audio：用于克隆的参考音频
ref_text：参考音频的转录文本

使用已创建的语音

创建语音后，使用 --ref_voice 参数进行 TTS。指令将自动加载：
bash
uv run --project / python /scripts/qwen-audio.py tts --text 需要朗读的新文本 --output /path/to/save.wav --ref_voice my-voice-id --instruct 非常开心和兴奋。

可选参数：--instruct 用于情感控制。

自动语音识别（STT）

bash uv run --project / python /scripts/qwen-audio.py stt --audio /sample_audio.wav --output /path/to/save.txt --output-format txt

测试音频：https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav
output-format：txt | ass | srt | all

返回结果（JSON）：
json
{
text: 转录后的文本内容,
duration: 10.5,
sample_rate: 16000,
files: [/path/to/save.txt, /path/to/save.srt],
success: true
}

qwen-audio通义音频库

qwen-audio

Qwen-Audio

Overview

Prerequisites

Environment checks

Capabilities

Voice Management

Create a Voice

List Voices

Text to Speech

TTS Voice Pre-check (Required)

Voice Cloning

Use a Created Voice

Automatic Speech Recognition (STT)

Qwen-Audio

概述

前置条件

环境检查

功能

语音管理

创建语音

列出语音

文本转语音

TTS 语音预检查（必需）

语音克隆

使用已创建的语音

自动语音识别（STT）

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

qwen-audio通义音频库

qwen-audio

Qwen-Audio

Overview

Prerequisites

Environment checks

Capabilities

Voice Management

Create a Voice

List Voices

Text to Speech

TTS Voice Pre-check (Required)

Voice Cloning

Use a Created Voice

Automatic Speech Recognition (STT)

Qwen-Audio

概述

前置条件

环境检查

功能

语音管理

创建语音

列出语音

文本转语音

TTS 语音预检查（必需）

语音克隆

使用已创建的语音

自动语音识别（STT）

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement