Deepgram Voice Workflow
Overview
Use this skill for a complete speech workflow:
- 1. transcribe audio to text with Deepgram STT
- optionally synthesize a spoken reply with Deepgram TTS
- return structured outputs that can feed chat or agent pipelines
This skill is the right choice when the task is broader than plain transcription and needs an input-audio to output-audio pipeline.
Quick Start
Transcribe only
CODEBLOCK0
Generate speech from text
CODEBLOCK1
Run the full pipeline
CODEBLOCK2
Environment
Set DEEPGRAM_API_KEY before use.
The bundled scripts also fall back to reading it from:
Workflow Decision
Use deepgram-transcribe.sh when
- - only text transcription is needed
- the downstream system will generate its own reply
- the task is speech-to-text only
Use deepgram-tts.sh when
- - text already exists
- only an MP3 spoken response is needed
- the workflow is text-to-speech only
Use neko-voice-pipeline.sh when
- - the task begins with an audio file
- a transcript is needed
- an optional spoken reply should be generated in the same flow
Outputs
STT output
INLINECODE5 writes:
- - transcript text file
- raw API JSON file next to it
TTS output
INLINECODE6 writes:
Pipeline output
INLINECODE7 prints JSON with:
- - INLINECODE8
- INLINECODE9
- INLINECODE10
- INLINECODE11
This makes it easy to wire into scripts or adapters.
Typical Uses
Prefer this skill for:
- - transcribing Telegram/QQ/OneBot voice messages
- generating MP3 replies to short voice prompts
- building bot-side voice input/output automation
- testing speech pipelines from shell without introducing a full SDK
Notes
- - Defaults are tuned for lightweight practical use, not maximal configurability.
- INLINECODE12 defaults to
model=nova-2 and language=zh. - INLINECODE15 defaults to
model=aura-2-luna-en; override the model when a different voice is preferred. - Inspect the raw JSON transcript response when debugging recognition quality or API errors.
References
Read these files when needed:
- -
references/stt-notes.md for transcription details - INLINECODE18 for speech synthesis details
- INLINECODE19 for end-to-end pipeline behavior
Deepgram 语音工作流
概述
使用此技能完成完整的语音工作流:
- 1. 通过 Deepgram STT 将音频转录为文本
- 可选地通过 Deepgram TTS 合成语音回复
- 返回结构化输出,可馈送至聊天或智能体管道
当任务范围超出单纯转录,且需要输入音频到输出音频的管道时,此技能是正确选择。
快速开始
仅转录
bash
{baseDir}/scripts/deepgram-transcribe.sh /path/to/audio.ogg
从文本生成语音
bash
{baseDir}/scripts/deepgram-tts.sh 你好,我是 Neko。
运行完整管道
bash
{baseDir}/scripts/neko-voice-pipeline.sh /path/to/audio.ogg --reply 收到啦,这是语音回复测试。
环境
使用前设置 DEEPGRAMAPIKEY。
附带的脚本也会回退从以下位置读取:
工作流决策
使用 deepgram-transcribe.sh 当
- - 仅需要文本转录
- 下游系统将自行生成回复
- 任务仅为语音转文本
使用 deepgram-tts.sh 当
- - 文本已存在
- 仅需要 MP3 语音回复
- 工作流仅为文本转语音
使用 neko-voice-pipeline.sh 当
- - 任务以音频文件开始
- 需要转录文本
- 应在同一流程中生成可选的语音回复
输出
STT 输出
deepgram-transcribe.sh 写入:
- - 转录文本文件
- 原始 API JSON 文件(位于同一目录)
TTS 输出
deepgram-tts.sh 写入:
管道输出
neko-voice-pipeline.sh 打印包含以下内容的 JSON:
- - outdir
- transcriptpath
- transcript
- replyaudiopath
这使得可以轻松接入脚本或适配器。
典型用途
优先使用此技能用于:
- - 转录 Telegram/QQ/OneBot 语音消息
- 为简短语音提示生成 MP3 回复
- 构建机器人端语音输入/输出自动化
- 从 shell 测试语音管道,无需引入完整 SDK
注意事项
- - 默认设置为轻量级实际使用而调优,并非最大可配置性。
- deepgram-transcribe.sh 默认使用 model=nova-2 和 language=zh。
- deepgram-tts.sh 默认使用 model=aura-2-luna-en;如需不同语音,请覆盖模型。
- 调试识别质量或 API 错误时,检查原始 JSON 转录响应。
参考
需要时阅读以下文件:
- - references/stt-notes.md 了解转录详情
- references/tts-notes.md 了解语音合成详情
- references/pipeline-notes.md 了解端到端管道行为