Speech Generation Skill
Generate spoken audio for the current project (narration, product demo voiceover, IVR prompts, accessibility reads). Defaults to gpt-4o-mini-tts-2025-12-15 and built-in voices, and prefers the bundled CLI for deterministic, reproducible runs.
When to use
- - Generate a single spoken clip from text
- Generate a batch of prompts (many lines, many files)
Decision tree (single vs batch)
- - If the user provides multiple lines/prompts or wants many outputs -> batch
- Else -> single
Workflow
- 1. Decide intent: single vs batch (see decision tree above).
- Collect inputs up front: exact text (verbatim), desired voice, delivery style, format, and any constraints.
- If batch: write a temporary JSONL under tmp/ (one job per line), run once, then delete the JSONL.
- Augment instructions into a short labeled spec without rewriting the input text.
- Run the bundled CLI (
scripts/text_to_speech.py) with sensible defaults (see references/cli.md). - For important clips, validate: intelligibility, pacing, pronunciation, and adherence to constraints.
- Iterate with a single targeted change (voice, speed, or instructions), then re-check.
- Save/return final outputs and note the final text + instructions + flags used.
Temp and output conventions
- - Use
tmp/speech/ for intermediate files (for example JSONL batches); delete when done. - Write final artifacts under
output/speech/ when working in this repo. - Use
--out or --out-dir to control output paths; keep filenames stable and descriptive.
Dependencies (install if missing)
Prefer uv for dependency management.
Python packages:
CODEBLOCK0
If uv is unavailable:
CODEBLOCK1
Environment
- -
OPENAI_API_KEY must be set for live API calls.
If the key is missing, give the user these steps:
- 1. Create an API key in the OpenAI platform UI: https://platform.openai.com/api-keys
- Set
OPENAI_API_KEY as an environment variable in their system. - Offer to guide them through setting the environment variable for their OS/shell if needed.
- - Never ask the user to paste the full key in chat. Ask them to set it locally and confirm when ready.
If installation isn't possible in this environment, tell the user which dependency is missing and how to install it locally.
Defaults & rules
- - Use
gpt-4o-mini-tts-2025-12-15 unless the user requests another model. - Default voice:
cedar. If the user wants a brighter tone, prefer marin. - Built-in voices only. Custom voices are out of scope for this skill.
- INLINECODE13 are supported for GPT-4o mini TTS models, but not for
tts-1 or tts-1-hd. - Input length must be <= 4096 characters per request. Split longer text into chunks.
- Enforce 50 requests/minute. The CLI caps
--rpm at 50. - Require
OPENAI_API_KEY before any live API call. - Provide a clear disclosure to end users that the voice is AI-generated.
- Use the OpenAI Python SDK (
openai package) for all API calls; do not use raw HTTP. - Prefer the bundled CLI (
scripts/text_to_speech.py) over writing new one-off scripts. - Never modify
scripts/text_to_speech.py. If something is missing, ask the user before doing anything else.
Instruction augmentation
Reformat user direction into a short, labeled spec. Only make implicit details explicit; do not invent new requirements.
Quick clarification (augmentation vs invention):
- - If the user says "narration for a demo", you may add implied delivery constraints (clear, steady pacing, friendly tone).
- Do not introduce a new persona, accent, or emotional style the user did not request.
Template (include only relevant lines):
CODEBLOCK2
Augmentation rules:
- - Keep it short; add only details the user already implied or provided elsewhere.
- Do not rewrite the input text.
- If any critical detail is missing and blocks success, ask a question; otherwise proceed.
Examples
Single example (narration)
CODEBLOCK3
Batch example (IVR prompts)
CODEBLOCK4
Instructioning best practices (short list)
- - Structure directions as: affect -> tone -> pacing -> emotion -> pronunciation/pauses -> emphasis.
- Keep 4 to 8 short lines; avoid conflicting guidance.
- For names/acronyms, add pronunciation hints (e.g., "enunciate A-I") or supply a phonetic spelling in the text.
- For edits/iterations, repeat invariants (e.g., "keep pacing steady") to reduce drift.
- Iterate with single-change follow-ups.
More principles: references/prompting.md. Copy/paste specs: references/sample-prompts.md.
Guidance by use case
Use these modules when the request is for a specific delivery style. They provide targeted defaults and templates.
- - Narration / explainer: INLINECODE23
- Product demo / voiceover: INLINECODE24
- IVR / phone prompts: INLINECODE25
- Accessibility reads: INLINECODE26
CLI + environment notes
- - CLI commands + examples: INLINECODE27
- API parameter quick reference: INLINECODE28
- Instruction patterns + examples: INLINECODE29
- If network approvals / sandbox settings are getting in the way: INLINECODE30
Reference map
- -
references/cli.md: how to run speech generation/batches via scripts/text_to_speech.py (commands, flags, recipes). references/audio-api.md: API parameters, limits, voice list.references/voice-directions.md: instruction patterns and examples.references/prompting.md: instruction best practices (structure, constraints, iteration patterns).references/sample-prompts.md: copy/paste instruction recipes (examples only; no extra theory).references/narration.md: templates + defaults for narration and explainers.references/voiceover.md: templates + defaults for product demo voiceovers.references/ivr.md: templates + defaults for IVR/phone prompts.references/accessibility.md: templates + defaults for accessibility reads.references/codex-network.md: environment/sandbox/network-approval troubleshooting.
语音生成技能
为当前项目生成语音音频(旁白、产品演示配音、IVR提示、无障碍朗读)。默认使用 gpt-4o-mini-tts-2025-12-15 和内置语音,并优先使用捆绑CLI以确保确定性和可复现运行。
使用时机
- - 从文本生成单个语音片段
- 生成一批提示(多行文本、多个文件)
决策树(单个 vs 批量)
- - 如果用户提供多行/多个提示或需要多个输出 -> 批量
- 否则 -> 单个
工作流程
- 1. 确定意图:单个还是批量(参见上述决策树)。
- 预先收集输入:精确文本(逐字)、期望语音、表达风格、格式及任何约束条件。
- 如果是批量:在 tmp/ 下编写临时JSONL文件(每行一个任务),运行一次,然后删除JSONL文件。
- 在不重写输入文本的情况下,将指令扩充为简短的标注规范。
- 使用合理默认值运行捆绑CLI(scripts/texttospeech.py)(参见 references/cli.md)。
- 对于重要片段,验证:清晰度、节奏、发音及约束条件遵守情况。
- 通过单一针对性修改(语音、速度或指令)进行迭代,然后重新检查。
- 保存/返回最终输出,并记录最终使用的文本+指令+参数。
临时和输出规范
- - 使用 tmp/speech/ 存放中间文件(例如JSONL批次);完成后删除。
- 在此仓库中工作时,将最终产物写入 output/speech/。
- 使用 --out 或 --out-dir 控制输出路径;保持文件名稳定且具有描述性。
依赖项(如缺失则安装)
优先使用 uv 管理依赖项。
Python 包:
uv pip install openai
如果 uv 不可用:
python3 -m pip install openai
环境
- - 实时API调用必须设置 OPENAIAPIKEY。
如果缺少密钥,请按以下步骤指导用户:
- 1. 在OpenAI平台UI中创建API密钥:https://platform.openai.com/api-keys
- 在系统中将 OPENAIAPIKEY 设置为环境变量。
- 如有需要,可指导用户根据其操作系统/Shell设置环境变量。
- - 切勿要求用户在聊天中粘贴完整密钥。请要求用户在本地设置,并在准备就绪时确认。
如果在此环境中无法安装,请告知用户缺失的依赖项以及如何在本地安装。
默认值和规则
- - 除非用户要求其他模型,否则使用 gpt-4o-mini-tts-2025-12-15。
- 默认语音:cedar。如果用户想要更明亮的音色,优先使用 marin。
- 仅使用内置语音。自定义语音不在本技能范围内。
- GPT-4o mini TTS模型支持 instructions,但 tts-1 或 tts-1-hd 不支持。
- 每次请求输入长度不得超过4096个字符。较长的文本需分块处理。
- 限制50次请求/分钟。CLI将 --rpm 上限设为50。
- 任何实时API调用前都需要 OPENAIAPIKEY。
- 向最终用户明确说明语音由AI生成。
- 所有API调用使用OpenAI Python SDK(openai 包);不使用原始HTTP。
- 优先使用捆绑CLI(scripts/texttospeech.py),而非编写新的临时脚本。
- 切勿修改 scripts/texttospeech.py。如有缺失,先询问用户再采取其他操作。
指令扩充
将用户指示重新格式化为简短、标注的规范。仅将隐含细节明确化;不添加新需求。
快速澄清(扩充 vs 发明):
- - 如果用户说演示旁白,可以添加隐含的表达约束(清晰、稳定的节奏、友好的语气)。
- 不引入用户未要求的新角色、口音或情感风格。
模板(仅包含相关行):
语音特质:<声音的整体特征和质感>
语气:<态度、正式程度、温暖度>
节奏:<缓慢、稳定、轻快>
情感:<要传达的关键情感>
发音:<需要清晰发音或强调的词语>
停顿:<需要有意停顿的位置>
强调:<需要重读的关键词或短语>
表达:<韵律或节奏说明>
扩充规则:
- - 保持简短;仅添加用户已暗示或在其他地方提供的细节。
- 不重写输入文本。
- 如果缺少任何关键细节且影响成功,则提问;否则继续执行。
示例
单个示例(旁白)
输入文本:欢迎来到演示。今天我们将展示它的工作原理。
指令:
语音特质:温暖而沉稳。
语气:友好且自信。
节奏:稳定适中。
强调:重读演示和展示。
批量示例(IVR提示)
{input:感谢您的来电。请稍候。,voice:cedar,response_format:mp3,out:hold.mp3}
{input:销售请按1。技术支持请按2。,voice:marin,instructions:语气:清晰中性。节奏:缓慢。,response_format:wav}
指令最佳实践(简要列表)
- - 指令结构:语音特质 -> 语气 -> 节奏 -> 情感 -> 发音/停顿 -> 强调。
- 保持4到8行简短内容;避免冲突指导。
- 对于名称/缩写,添加发音提示(例如清晰读出A-I)或在文本中提供音标拼写。
- 对于编辑/迭代,重复不变项(例如保持节奏稳定)以减少偏差。
- 通过单一修改进行迭代。
更多原则:references/prompting.md。可复制粘贴的规范:references/sample-prompts.md。
按用例指导
当请求特定表达风格时使用以下模块。它们提供针对性的默认值和模板。
- - 旁白/解说:references/narration.md
- 产品演示/配音:references/voiceover.md
- IVR/电话提示:references/ivr.md
- 无障碍朗读:references/accessibility.md
CLI + 环境说明
- - CLI命令+示例:references/cli.md
- API参数快速参考:references/audio-api.md
- 指令模式+示例:references/voice-directions.md
- 如果网络审批/沙箱设置造成阻碍:references/codex-network.md
参考映射
- - references/cli.md:如何通过 scripts/texttospeech.py 运行语音生成/批次(命令、参数、配方)。
- references/audio-api.md:API参数、限制、语音列表。
- references/voice-directions.md:指令模式和示例。
- references/prompting.md:指令最佳实践(结构、约束、迭代模式)。
- references/sample-prompts.md:可复制粘贴的指令配方(仅示例;无额外理论)。
- references/narration.md:旁白和解说的模板+默认值。
- references/voiceover.md:产品演示配音的模板+默认值。
- references/ivr.md:IVR/电话提示的模板+默认值。
- references/accessibility.md:无障碍朗读的模板+默认值。
- references/codex-network.md:环境/沙箱/网络审批故障排除。