Text-to-Speech (TTS)

This skill supports Gitee AI TTS plus CosyVoice voice feature extraction workflows.
It supports fifteen user-facing model choices for TTS:

- INLINECODE0
INLINECODE1
INLINECODE2
INLINECODE3
INLINECODE4
INLINECODE5
INLINECODE6
INLINECODE7
INLINECODE8
INLINECODE9
INLINECODE10
INLINECODE11
INLINECODE12
INLINECODE13
INLINECODE14

When the user does not specify a model, ask them to choose one. After the model is chosen, only ask for parameters that
are relevant to that model.

Usage

Use the bundled script to generate speech.

CODEBLOCK0

For CosyVoice-300M voice feature extraction (voice cloning prep), use:

CODEBLOCK1

Options

- --model required: audiofly, chattts, cosyvoice2, cosyvoice3, cosyvoice-300m, fish-speech-1.2-sft,

index-tts-1.5, index-tts-2, glm-tts, megatts3, moss-ttsd-v0.5, qwen-tts, spark-tts-0.5b, step-audio-tts-3b, or vibevoice-large

- --text required in general: text to synthesize. For Qwen3-TTS multi-input mode (--qwen-inputs-json), --text is

optional

- --mode optional: auto, sync, or INLINECODE37
INLINECODE38 optional: model-specific style prompt such as ChatTTS tags
INLINECODE39 optional: reference transcript for style-conditioned models
INLINECODE40 optional: reference audio URL for style-conditioned models
INLINECODE41 optional: structured Qwen3-TTS inputs JSON (array/object). Supports mixed built-in and custom

voice items

- --speaker optional: Qwen3-TTS built-in speaker for single input (Vivian, Serena, Uncle_Fu, Dylan, Eric,

Ryan, Aiden, Ono_Anna, Sohee)

- --language optional: Qwen3-TTS language for single input (Chinese or English)
INLINECODE56 optional: Qwen3-TTS style instruction for single input
INLINECODE57 optional: vibevoice-large reference audio; supports one URL or JSON array string such as

["https://a.wav","https://b.wav"]

- --emo-audio-prompt-url optional: emotion reference audio URL for IndexTTS-2
INLINECODE61 optional: emotion mixing weight for IndexTTS-2 audio emotion control
INLINECODE62 optional: emotion control text for IndexTTS-2
INLINECODE63 optional: enable or disable emo_text for IndexTTS-2 (true/false)
INLINECODE67 optional: reference prompt WAV URL for CosyVoice2 or CosyVoice3
INLINECODE68 optional: reference voice audio URL for ChatTTS or fish-speech-1.2-sft cloning
INLINECODE69 optional: model-specific instruction text such as CosyVoice2 or CosyVoice3 speaking style guidance
INLINECODE70 optional: model-specific seed value such as CosyVoice2 or CosyVoice3
INLINECODE71 optional: single or role for moss-ttsd-v0.5 (required when mode cannot be inferred from fields)
INLINECODE75 optional: single-speaker reference audio URL for moss-ttsd-v0.5 single mode
INLINECODE77 optional: single-speaker reference transcript for moss-ttsd-v0.5 single mode
INLINECODE79 optional: speaker-1 reference audio URL for moss-ttsd-v0.5 role mode
INLINECODE81 optional: speaker-1 reference transcript for moss-ttsd-v0.5 role mode
INLINECODE83 optional: speaker-2 reference audio URL for moss-ttsd-v0.5 role mode
INLINECODE85 optional: speaker-2 reference transcript for moss-ttsd-v0.5 role mode
INLINECODE87 optional: enable or disable use_normalize for moss-ttsd-v0.5 (true/false)
INLINECODE92 optional: prompt language hint for models such as MegaTTS3
INLINECODE93 optional: pronunciation intelligibility weight for models such as MegaTTS3
INLINECODE94 optional: timbre similarity weight for models such as MegaTTS3
INLINECODE95 optional: model-specific sampling temperature
INLINECODE96 optional: model-specific top-p sampling value
INLINECODE97 optional: model-specific top-k sampling value
INLINECODE98 optional: async TTS gender hint
INLINECODE99 optional: async TTS pitch hint
INLINECODE100 optional: async TTS speed hint (for example CosyVoice3, Spark-TTS-0.5B, or Qwen3-TTS)
INLINECODE101 optional: AudioFly generation step count
INLINECODE102 optional: AudioFly classifier-free guidance scale
INLINECODE103 optional: AudioFly or Qwen3-TTS output format such as mp3 or INLINECODE105
INLINECODE106 optional: OpenAI-compatible voice field when supported by the target model
INLINECODE107 optional: JSON object for explicitly requested undocumented fields
INLINECODE108 optional: url or blob for sync TTS
INLINECODE111 optional: output file path when sync TTS returns binary audio
INLINECODE112 optional: request header X-Failover-Enabled, defaults to INLINECODE114
INLINECODE115 options: --prompt, --file-url (URL only), --model (default

FunAudioLLM-CosyVoice-300M), --failover-enabled, --output, INLINECODE122

Workflow

1. Determine whether the user wants speech synthesis or CosyVoice voice-feature extraction.
For speech synthesis: ask the user to choose one of audiofly, chattts, cosyvoice2, cosyvoice3,

cosyvoice-300m, fish-speech-1.2-sft, index-tts-1.5, index-tts-2, glm-tts, megatts3, moss-ttsd-v0.5, qwen-tts, spark-tts-0.5b, step-audio-tts-3b, or vibevoice-large if not specified.

3. For speech synthesis: read references/models.md, gather missing model-specific params, and

execute perform_tts.py.

4. For voice-feature extraction: execute perform_voice_feature_extraction.py with --prompt and URL-only

--file-url.

5. Parse script output.
For TTS output, prioritize AUDIO_URL: then AUDIO_FILE: then TTS_RESULT:.
For voice feature output, prioritize VOICE_URL: (if present), otherwise return VOICE_FEATURE_FILE: and summarize

VOICE_FEATURE_RESULT:.

Notes

- Keep the answer language consistent with the user's language.
This script is standard-library only and is intended to run directly with python; do not require uv for

moark-tts.

- If GITEEAI_API_KEY is missing, remind the user to provide --api-key.
By default, all TTS requests send X-Failover-Enabled: true. Only set --failover-enabled false when the user

explicitly needs to disable failover.

- audiofly is mapped to the official model name AudioFly. Use async mode only. When the user shows an OpenAI SDK

example that puts num_inference_steps, guidance_scale, or output_format under extra_body, map them to --num-inference-steps, --guidance-scale, and --output-format.

- chattts is mapped to the official model name ChatTTS. When the user shows an OpenAI SDK example that puts

prompt, temperature, top_P, top_K, or voice_url under extra_body, map them to --prompt, --temperature, --top-p, --top-k, and --voice-url.

- cosyvoice2 is mapped to the official model name CosyVoice2. When the user shows an OpenAI SDK example that puts

prompt_wav_url, prompt_text, instruct_text, or seed under extra_body, map them to --prompt-wav-url, --prompt-text, --instruct-text, and --seed.

- cosyvoice3 is mapped to the official model name CosyVoice3. Use async mode only. When the user shows an OpenAI SDK

example that puts prompt_wav_url, prompt_text, instruct_text, speed, or seed under extra_body, map them to --prompt-wav-url, --prompt-text, --instruct-text, --speed, and --seed.

- cosyvoice-300m is mapped to FunAudioLLM-CosyVoice-300M for sync /audio/speech. Map OpenAI INLINECODE204

to --voice-url.

- CosyVoice voice-feature extraction uses /audio/voice-feature-extraction and is handled by

perform_voice_feature_extraction.py; --file-url must be an http(s) URL (no local file path support).

- fish-speech-1.2-sft uses sync /audio/speech. When the user shows an OpenAI SDK example that puts voice_url under

extra_body, map it to --voice-url.

- index-tts-1.5 currently uses the sync /audio/speech endpoint. When the user shows an OpenAI SDK example that puts

prompt_audio_url under extra_body, map it to the script's --prompt-audio-url.

- index-tts-2 supports four emotion-control patterns: sync/async + audio-emotion/text-emotion. Map

emo_audio_prompt_url + emo_alpha to --emo-audio-prompt-url + --emo-alpha; map emo_text + use_emo_text to --emo-text + --use-emo-text. In auto mode it defaults to sync; when user asks async, force --mode async.

- megatts3 is mapped to the official model name MegaTTS3. When the user shows an OpenAI SDK example that puts

prompt_language, intelligibility_weight, or similarity_weight under extra_body, map them to --prompt-language, --intelligibility-weight, and --similarity-weight.

- step-audio-tts-3b is mapped to the official model name Step-Audio-TTS-3B. When the user shows an OpenAI SDK

example that puts prompt_audio_url and prompt_text under extra_body, map them to --prompt-audio-url and --prompt-text.

- spark-tts-0.5b is mapped to the official model name Spark-TTS-0.5B. Use async mode only. For plain synthesis, just

pass text. For voice cloning, map prompt_audio_url and prompt_text to --prompt-audio-url and --prompt-text; gender/pitch/speed can be passed when explicitly requested.

- qwen-tts is mapped to the official model name Qwen3-TTS. Use async mode only. Prefer structured inputs items:

- Built-in speaker item: prompt + speaker + optional language (Chinese/English) + optional instruction. - Custom voice item: prompt + prompt_audio_url + prompt_text + optional language + optional instruction. - Use --qwen-inputs-json for multiple items in one request; use --speaker/--language/--instruction for single-item mode. - Built-in speakers: Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee.

- moss-ttsd-v0.5 is mapped to the official model name MOSS-TTSD-v0.5. Use async mode only. Map single mode fields

prompt_audio_single_url + prompt_text_single to --prompt-audio-single-url + --prompt-text-single, and role mode fields (prompt_audio_1_url, prompt_text_1, prompt_audio_2_url, prompt_text_2) to the matching CLI options. Pass audio_mode through --audio-mode and use_normalize through --use-normalize.

- vibevoice-large is mapped to the official model name VibeVoice-Large. Use async mode only. Map INLINECODE297

to --prompt-audio-urls, and accept both a single URL string and a JSON array string. When the user provides only prompt_audio_url, map it into prompt_audio_urls automatically for compatibility.

- glm-tts currently exposes only the basic sync request in the official OpenAPI spec.
Do not invent model parameters. If a field is not documented for that model, only pass it when the user explicitly

asked for it and use --extra-body-json.

文本转语音（TTS）

该技能支持 Gitee AI TTS 以及 CosyVoice 语音特征提取工作流。
它支持十五种面向用户的 TTS 模型选择：

- audiofly
chattts
cosyvoice2
cosyvoice3
cosyvoice-300m
fish-speech-1.2-sft
index-tts-1.5
index-tts-2
glm-tts
megatts3
moss-ttsd-v0.5
qwen-tts
spark-tts-0.5b
step-audio-tts-3b
vibevoice-large

当用户未指定模型时，请要求用户选择一个。选定模型后，仅询问与该模型相关的参数。

使用方法

使用捆绑脚本生成语音。

bash
python {baseDir}/scripts/performtts.py --model cosyvoice2 --text 你好，我是模力方舟。 --voice alloy --api-key YOURAPI_KEY

对于 CosyVoice-300M 语音特征提取（语音克隆准备），请使用：

bash
python {baseDir}/scripts/performvoicefeatureextraction.py --model FunAudioLLM-CosyVoice-300M --prompt 提供用于声纹提取的提示文本 --file-url https://example.com/sample.mp3 --api-key YOURAPI_KEY

选项

- --model 必填：audiofly、chattts、cosyvoice2、cosyvoice3、cosyvoice-300m、fish-speech-1.2-sft、index-tts-1.5、index-tts-2、glm-tts、megatts3、moss-ttsd-v0.5、qwen-tts、spark-tts-0.5b、step-audio-tts-3b 或 vibevoice-large
--text 通常必填：要合成的文本。对于 Qwen3-TTS 多输入模式（--qwen-inputs-json），--text 为可选
--mode 可选：auto、sync 或 async
--prompt 可选：模型特定的风格提示，如 ChatTTS 标签
--prompt-text 可选：用于风格条件模型的参考转录文本
--prompt-audio-url 可选：用于风格条件模型的参考音频 URL
--qwen-inputs-json 可选：结构化的 Qwen3-TTS inputs JSON（数组/对象）。支持混合内置和自定义语音项
--speaker 可选：Qwen3-TTS 单输入的内置说话人（Vivian、Serena、UncleFu、Dylan、Eric、Ryan、Aiden、OnoAnna、Sohee）
--language 可选：Qwen3-TTS 单输入的语言（Chinese 或 English）
--instruction 可选：Qwen3-TTS 单输入的风格指令
--prompt-audio-urls 可选：vibevoice-large 参考音频；支持单个 URL 或 JSON 数组字符串，如 [https://a.wav,https://b.wav]
--emo-audio-prompt-url 可选：IndexTTS-2 的情感参考音频 URL
--emo-alpha 可选：IndexTTS-2 音频情感控制的混合权重
--emo-text 可选：IndexTTS-2 的情感控制文本
--use-emo-text 可选：启用或禁用 IndexTTS-2 的 emotext（true/false）
--prompt-wav-url 可选：CosyVoice2 或 CosyVoice3 的参考提示 WAV URL
--voice-url 可选：ChatTTS 或 fish-speech-1.2-sft 克隆的参考语音音频 URL
--instruct-text 可选：模型特定的指令文本，如 CosyVoice2 或 CosyVoice3 的说话风格指导
--seed 可选：模型特定的种子值，如 CosyVoice2 或 CosyVoice3
--audio-mode 可选：moss-ttsd-v0.5 的 single 或 role（当无法从字段推断模式时必填）
--prompt-audio-single-url 可选：moss-ttsd-v0.5 单说话人模式的参考音频 URL
--prompt-text-single 可选：moss-ttsd-v0.5 单说话人模式的参考转录文本
--prompt-audio-1-url 可选：moss-ttsd-v0.5 角色模式的说话人1参考音频 URL
--prompt-text-1 可选：moss-ttsd-v0.5 角色模式的说话人1参考转录文本
--prompt-audio-2-url 可选：moss-ttsd-v0.5 角色模式的说话人2参考音频 URL
--prompt-text-2 可选：moss-ttsd-v0.5 角色模式的说话人2参考转录文本
--use-normalize 可选：启用或禁用 moss-ttsd-v0.5 的 usenormalize（true/false）
--prompt-language 可选：MegaTTS3 等模型的提示语言提示
--intelligibility-weight 可选：MegaTTS3 等模型的发音清晰度权重
--similarity-weight 可选：MegaTTS3 等模型的音色相似度权重
--temperature 可选：模型特定的采样温度
--top-p 可选：模型特定的 top-p 采样值
--top-k 可选：模型特定的 top-k 采样值
--gender 可选：异步 TTS 性别提示
--pitch 可选：异步 TTS 音调提示
--speed 可选：异步 TTS 速度提示（例如 CosyVoice3、Spark-TTS-0.5B 或 Qwen3-TTS）
--num-inference-steps 可选：AudioFly 生成步数
--guidance-scale 可选：AudioFly 无分类器引导尺度
--output-format 可选：AudioFly 或 Qwen3-TTS 的输出格式，如 mp3 或 wav
--voice 可选：目标模型支持时的 OpenAI 兼容语音字段
--extra-body-json 可选：用于明确请求的未记录字段的 JSON 对象
--response-data-format 可选：同步 TTS 的 url 或 blob
--output 可选：同步 TTS 返回二进制音频时的输出文件路径
--failover-enabled 可选：请求头 X-Failover-Enabled，默认为 true
performvoicefeature_extraction.py 选项：--prompt、--file-url（仅 URL）、--model（默认为 FunAudioLLM-CosyVoice-300M）、--failover-enabled、--output、--api-key

工作流程

1. 确定用户需要语音合成还是 CosyVoice 语音特征提取。
对于语音合成：如果用户未指定，请要求用户从 audiofly、chattts、cosyvoice2、cosyvoice3、cosyvoice-300m、fish-speech-1.2-sft、index-tts-1.5、index-tts-2、glm-tts、megatts3、moss-ttsd-v0.5、qwen-tts、spark-tts-0.5b、step-audio-tts-3b 或 vibevoice-large 中选择一个。
对于语音合成：阅读 references/models.md，收集缺失的模型特定参数，并执行 performtts.py。
对于语音特征提取：使用 --prompt 和仅 URL 的 --file-url 执行 performvoicefeature_extraction.py。
解析脚本输出。
对于 T

moark-tts语音合成技能

moark-tts

Text-to-Speech (TTS)

Usage

Options

Workflow

Notes

文本转语音（TTS）

使用方法

选项

工作流程

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

moark-tts语音合成技能

moark-tts

Text-to-Speech (TTS)

Usage

Options

Workflow

Notes

文本转语音（TTS）

使用方法

选项

工作流程

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement