Text-to-Speech (TTS)
This skill supports Gitee AI TTS plus CosyVoice voice feature extraction workflows.
It supports fifteen user-facing model choices for TTS:
- - INLINECODE0
- INLINECODE1
- INLINECODE2
- INLINECODE3
- INLINECODE4
- INLINECODE5
- INLINECODE6
- INLINECODE7
- INLINECODE8
- INLINECODE9
- INLINECODE10
- INLINECODE11
- INLINECODE12
- INLINECODE13
- INLINECODE14
When the user does not specify a model, ask them to choose one. After the model is chosen, only ask for parameters that
are relevant to that model.
Usage
Use the bundled script to generate speech.
CODEBLOCK0
For CosyVoice-300M voice feature extraction (voice cloning prep), use:
CODEBLOCK1
Options
- -
--model required: audiofly, chattts, cosyvoice2, cosyvoice3, cosyvoice-300m, fish-speech-1.2-sft,
index-tts-1.5,
index-tts-2,
glm-tts,
megatts3,
moss-ttsd-v0.5,
qwen-tts,
spark-tts-0.5b,
step-audio-tts-3b, or
vibevoice-large
- -
--text required in general: text to synthesize. For Qwen3-TTS multi-input mode (--qwen-inputs-json), --text is
optional
- -
--mode optional: auto, sync, or INLINECODE37 - INLINECODE38 optional: model-specific style prompt such as ChatTTS tags
- INLINECODE39 optional: reference transcript for style-conditioned models
- INLINECODE40 optional: reference audio URL for style-conditioned models
- INLINECODE41 optional: structured Qwen3-TTS
inputs JSON (array/object). Supports mixed built-in and custom
voice items
- -
--speaker optional: Qwen3-TTS built-in speaker for single input (Vivian, Serena, Uncle_Fu, Dylan, Eric,
Ryan,
Aiden,
Ono_Anna,
Sohee)
- -
--language optional: Qwen3-TTS language for single input (Chinese or English) - INLINECODE56 optional: Qwen3-TTS style instruction for single input
- INLINECODE57 optional:
vibevoice-large reference audio; supports one URL or JSON array string such as
["https://a.wav","https://b.wav"]
- -
--emo-audio-prompt-url optional: emotion reference audio URL for IndexTTS-2 - INLINECODE61 optional: emotion mixing weight for IndexTTS-2 audio emotion control
- INLINECODE62 optional: emotion control text for IndexTTS-2
- INLINECODE63 optional: enable or disable
emo_text for IndexTTS-2 (true/false) - INLINECODE67 optional: reference prompt WAV URL for CosyVoice2 or CosyVoice3
- INLINECODE68 optional: reference voice audio URL for ChatTTS or fish-speech-1.2-sft cloning
- INLINECODE69 optional: model-specific instruction text such as CosyVoice2 or CosyVoice3 speaking style guidance
- INLINECODE70 optional: model-specific seed value such as CosyVoice2 or CosyVoice3
- INLINECODE71 optional:
single or role for moss-ttsd-v0.5 (required when mode cannot be inferred from fields) - INLINECODE75 optional: single-speaker reference audio URL for
moss-ttsd-v0.5 single mode - INLINECODE77 optional: single-speaker reference transcript for
moss-ttsd-v0.5 single mode - INLINECODE79 optional: speaker-1 reference audio URL for
moss-ttsd-v0.5 role mode - INLINECODE81 optional: speaker-1 reference transcript for
moss-ttsd-v0.5 role mode - INLINECODE83 optional: speaker-2 reference audio URL for
moss-ttsd-v0.5 role mode - INLINECODE85 optional: speaker-2 reference transcript for
moss-ttsd-v0.5 role mode - INLINECODE87 optional: enable or disable
use_normalize for moss-ttsd-v0.5 (true/false) - INLINECODE92 optional: prompt language hint for models such as MegaTTS3
- INLINECODE93 optional: pronunciation intelligibility weight for models such as MegaTTS3
- INLINECODE94 optional: timbre similarity weight for models such as MegaTTS3
- INLINECODE95 optional: model-specific sampling temperature
- INLINECODE96 optional: model-specific top-p sampling value
- INLINECODE97 optional: model-specific top-k sampling value
- INLINECODE98 optional: async TTS gender hint
- INLINECODE99 optional: async TTS pitch hint
- INLINECODE100 optional: async TTS speed hint (for example CosyVoice3, Spark-TTS-0.5B, or Qwen3-TTS)
- INLINECODE101 optional: AudioFly generation step count
- INLINECODE102 optional: AudioFly classifier-free guidance scale
- INLINECODE103 optional: AudioFly or Qwen3-TTS output format such as
mp3 or INLINECODE105 - INLINECODE106 optional: OpenAI-compatible voice field when supported by the target model
- INLINECODE107 optional: JSON object for explicitly requested undocumented fields
- INLINECODE108 optional:
url or blob for sync TTS - INLINECODE111 optional: output file path when sync TTS returns binary audio
- INLINECODE112 optional: request header
X-Failover-Enabled, defaults to INLINECODE114 - INLINECODE115 options:
--prompt, --file-url (URL only), --model (default
FunAudioLLM-CosyVoice-300M),
--failover-enabled,
--output, INLINECODE122
Workflow
- 1. Determine whether the user wants speech synthesis or CosyVoice voice-feature extraction.
- For speech synthesis: ask the user to choose one of
audiofly, chattts, cosyvoice2, cosyvoice3,
cosyvoice-300m,
fish-speech-1.2-sft,
index-tts-1.5,
index-tts-2,
glm-tts,
megatts3,
moss-ttsd-v0.5,
qwen-tts,
spark-tts-0.5b,
step-audio-tts-3b, or
vibevoice-large if not specified.
- 3. For speech synthesis: read references/models.md, gather missing model-specific params, and
execute
perform_tts.py.
- 4. For voice-feature extraction: execute
perform_voice_feature_extraction.py with --prompt and URL-only
--file-url.
- 5. Parse script output.
- For TTS output, prioritize
AUDIO_URL: then AUDIO_FILE: then TTS_RESULT:. - For voice feature output, prioritize
VOICE_URL: (if present), otherwise return VOICE_FEATURE_FILE: and summarize
VOICE_FEATURE_RESULT:.
Notes
- - Keep the answer language consistent with the user's language.
- This script is standard-library only and is intended to run directly with
python; do not require uv for
moark-tts.
- - If
GITEEAI_API_KEY is missing, remind the user to provide --api-key. - By default, all TTS requests send
X-Failover-Enabled: true. Only set --failover-enabled false when the user
explicitly needs to disable failover.
- -
audiofly is mapped to the official model name AudioFly. Use async mode only. When the user shows an OpenAI SDK
example that puts
num_inference_steps,
guidance_scale, or
output_format under
extra_body, map them to
--num-inference-steps,
--guidance-scale, and
--output-format.
- -
chattts is mapped to the official model name ChatTTS. When the user shows an OpenAI SDK example that puts
prompt,
temperature,
top_P,
top_K, or
voice_url under
extra_body, map them to
--prompt,
--temperature,
--top-p,
--top-k, and
--voice-url.
- -
cosyvoice2 is mapped to the official model name CosyVoice2. When the user shows an OpenAI SDK example that puts
prompt_wav_url,
prompt_text,
instruct_text, or
seed under
extra_body, map them to
--prompt-wav-url,
--prompt-text,
--instruct-text, and
--seed.
- -
cosyvoice3 is mapped to the official model name CosyVoice3. Use async mode only. When the user shows an OpenAI SDK
example that puts
prompt_wav_url,
prompt_text,
instruct_text,
speed, or
seed under
extra_body, map them to
--prompt-wav-url,
--prompt-text,
--instruct-text,
--speed, and
--seed.
- -
cosyvoice-300m is mapped to FunAudioLLM-CosyVoice-300M for sync /audio/speech. Map OpenAI INLINECODE204
to
--voice-url.
- - CosyVoice voice-feature extraction uses
/audio/voice-feature-extraction and is handled by
perform_voice_feature_extraction.py;
--file-url must be an http(s) URL (no local file path support).
- -
fish-speech-1.2-sft uses sync /audio/speech. When the user shows an OpenAI SDK example that puts voice_url under
extra_body, map it to
--voice-url.
- -
index-tts-1.5 currently uses the sync /audio/speech endpoint. When the user shows an OpenAI SDK example that puts
prompt_audio_url under
extra_body, map it to the script's
--prompt-audio-url.
- -
index-tts-2 supports four emotion-control patterns: sync/async + audio-emotion/text-emotion. Map
emo_audio_prompt_url +
emo_alpha to
--emo-audio-prompt-url +
--emo-alpha; map
emo_text +
use_emo_text to
--emo-text +
--use-emo-text. In auto mode it defaults to sync; when user asks async, force
--mode async.
- -
megatts3 is mapped to the official model name MegaTTS3. When the user shows an OpenAI SDK example that puts
prompt_language,
intelligibility_weight, or
similarity_weight under
extra_body, map them to
--prompt-language,
--intelligibility-weight, and
--similarity-weight.
- -
step-audio-tts-3b is mapped to the official model name Step-Audio-TTS-3B. When the user shows an OpenAI SDK
example that puts
prompt_audio_url and
prompt_text under
extra_body, map them to
--prompt-audio-url and
--prompt-text.
- -
spark-tts-0.5b is mapped to the official model name Spark-TTS-0.5B. Use async mode only. For plain synthesis, just
pass text. For voice cloning, map
prompt_audio_url and
prompt_text to
--prompt-audio-url and
--prompt-text;
gender/
pitch/
speed can be passed when explicitly requested.
- -
qwen-tts is mapped to the official model name Qwen3-TTS. Use async mode only. Prefer structured inputs items:
- Built-in speaker item:
prompt +
speaker + optional
language (
Chinese/
English) + optional
instruction.
- Custom voice item:
prompt +
prompt_audio_url +
prompt_text + optional
language + optional
instruction.
- Use
--qwen-inputs-json for multiple items in one request; use
--speaker/
--language/
--instruction for
single-item mode.
- Built-in speakers:
Vivian,
Serena,
Uncle_Fu,
Dylan,
Eric,
Ryan,
Aiden,
Ono_Anna,
Sohee.
- -
moss-ttsd-v0.5 is mapped to the official model name MOSS-TTSD-v0.5. Use async mode only. Map single mode fields
prompt_audio_single_url +
prompt_text_single to
--prompt-audio-single-url +
--prompt-text-single, and role
mode fields (
prompt_audio_1_url,
prompt_text_1,
prompt_audio_2_url,
prompt_text_2) to the matching CLI
options. Pass
audio_mode through
--audio-mode and
use_normalize through
--use-normalize.
- -
vibevoice-large is mapped to the official model name VibeVoice-Large. Use async mode only. Map INLINECODE297
to
--prompt-audio-urls, and accept both a single URL string and a JSON array string. When the user provides only
prompt_audio_url, map it into
prompt_audio_urls automatically for compatibility.
- -
glm-tts currently exposes only the basic sync request in the official OpenAPI spec. - Do not invent model parameters. If a field is not documented for that model, only pass it when the user explicitly
asked for it and use
--extra-body-json.
文本转语音(TTS)
该技能支持 Gitee AI TTS 以及 CosyVoice 语音特征提取工作流。
它支持十五种面向用户的 TTS 模型选择:
- - audiofly
- chattts
- cosyvoice2
- cosyvoice3
- cosyvoice-300m
- fish-speech-1.2-sft
- index-tts-1.5
- index-tts-2
- glm-tts
- megatts3
- moss-ttsd-v0.5
- qwen-tts
- spark-tts-0.5b
- step-audio-tts-3b
- vibevoice-large
当用户未指定模型时,请要求用户选择一个。选定模型后,仅询问与该模型相关的参数。
使用方法
使用捆绑脚本生成语音。
bash
python {baseDir}/scripts/performtts.py --model cosyvoice2 --text 你好,我是模力方舟。 --voice alloy --api-key YOURAPI_KEY
对于 CosyVoice-300M 语音特征提取(语音克隆准备),请使用:
bash
python {baseDir}/scripts/performvoicefeatureextraction.py --model FunAudioLLM-CosyVoice-300M --prompt 提供用于声纹提取的提示文本 --file-url https://example.com/sample.mp3 --api-key YOURAPI_KEY
选项
- - --model 必填:audiofly、chattts、cosyvoice2、cosyvoice3、cosyvoice-300m、fish-speech-1.2-sft、index-tts-1.5、index-tts-2、glm-tts、megatts3、moss-ttsd-v0.5、qwen-tts、spark-tts-0.5b、step-audio-tts-3b 或 vibevoice-large
- --text 通常必填:要合成的文本。对于 Qwen3-TTS 多输入模式(--qwen-inputs-json),--text 为可选
- --mode 可选:auto、sync 或 async
- --prompt 可选:模型特定的风格提示,如 ChatTTS 标签
- --prompt-text 可选:用于风格条件模型的参考转录文本
- --prompt-audio-url 可选:用于风格条件模型的参考音频 URL
- --qwen-inputs-json 可选:结构化的 Qwen3-TTS inputs JSON(数组/对象)。支持混合内置和自定义语音项
- --speaker 可选:Qwen3-TTS 单输入的内置说话人(Vivian、Serena、UncleFu、Dylan、Eric、Ryan、Aiden、OnoAnna、Sohee)
- --language 可选:Qwen3-TTS 单输入的语言(Chinese 或 English)
- --instruction 可选:Qwen3-TTS 单输入的风格指令
- --prompt-audio-urls 可选:vibevoice-large 参考音频;支持单个 URL 或 JSON 数组字符串,如 [https://a.wav,https://b.wav]
- --emo-audio-prompt-url 可选:IndexTTS-2 的情感参考音频 URL
- --emo-alpha 可选:IndexTTS-2 音频情感控制的混合权重
- --emo-text 可选:IndexTTS-2 的情感控制文本
- --use-emo-text 可选:启用或禁用 IndexTTS-2 的 emotext(true/false)
- --prompt-wav-url 可选:CosyVoice2 或 CosyVoice3 的参考提示 WAV URL
- --voice-url 可选:ChatTTS 或 fish-speech-1.2-sft 克隆的参考语音音频 URL
- --instruct-text 可选:模型特定的指令文本,如 CosyVoice2 或 CosyVoice3 的说话风格指导
- --seed 可选:模型特定的种子值,如 CosyVoice2 或 CosyVoice3
- --audio-mode 可选:moss-ttsd-v0.5 的 single 或 role(当无法从字段推断模式时必填)
- --prompt-audio-single-url 可选:moss-ttsd-v0.5 单说话人模式的参考音频 URL
- --prompt-text-single 可选:moss-ttsd-v0.5 单说话人模式的参考转录文本
- --prompt-audio-1-url 可选:moss-ttsd-v0.5 角色模式的说话人1参考音频 URL
- --prompt-text-1 可选:moss-ttsd-v0.5 角色模式的说话人1参考转录文本
- --prompt-audio-2-url 可选:moss-ttsd-v0.5 角色模式的说话人2参考音频 URL
- --prompt-text-2 可选:moss-ttsd-v0.5 角色模式的说话人2参考转录文本
- --use-normalize 可选:启用或禁用 moss-ttsd-v0.5 的 usenormalize(true/false)
- --prompt-language 可选:MegaTTS3 等模型的提示语言提示
- --intelligibility-weight 可选:MegaTTS3 等模型的发音清晰度权重
- --similarity-weight 可选:MegaTTS3 等模型的音色相似度权重
- --temperature 可选:模型特定的采样温度
- --top-p 可选:模型特定的 top-p 采样值
- --top-k 可选:模型特定的 top-k 采样值
- --gender 可选:异步 TTS 性别提示
- --pitch 可选:异步 TTS 音调提示
- --speed 可选:异步 TTS 速度提示(例如 CosyVoice3、Spark-TTS-0.5B 或 Qwen3-TTS)
- --num-inference-steps 可选:AudioFly 生成步数
- --guidance-scale 可选:AudioFly 无分类器引导尺度
- --output-format 可选:AudioFly 或 Qwen3-TTS 的输出格式,如 mp3 或 wav
- --voice 可选:目标模型支持时的 OpenAI 兼容语音字段
- --extra-body-json 可选:用于明确请求的未记录字段的 JSON 对象
- --response-data-format 可选:同步 TTS 的 url 或 blob
- --output 可选:同步 TTS 返回二进制音频时的输出文件路径
- --failover-enabled 可选:请求头 X-Failover-Enabled,默认为 true
- performvoicefeature_extraction.py 选项:--prompt、--file-url(仅 URL)、--model(默认为 FunAudioLLM-CosyVoice-300M)、--failover-enabled、--output、--api-key
工作流程
- 1. 确定用户需要语音合成还是 CosyVoice 语音特征提取。
- 对于语音合成:如果用户未指定,请要求用户从 audiofly、chattts、cosyvoice2、cosyvoice3、cosyvoice-300m、fish-speech-1.2-sft、index-tts-1.5、index-tts-2、glm-tts、megatts3、moss-ttsd-v0.5、qwen-tts、spark-tts-0.5b、step-audio-tts-3b 或 vibevoice-large 中选择一个。
- 对于语音合成:阅读 references/models.md,收集缺失的模型特定参数,并执行 performtts.py。
- 对于语音特征提取:使用 --prompt 和仅 URL 的 --file-url 执行 performvoicefeature_extraction.py。
- 解析脚本输出。
- 对于 T