Local Voice Reply
Use this skill to turn text into a cloned/custom-voice audio reply and deliver it reliably to Feishu or Discord.
Structured skill definition
- - Purpose: local low-latency voice replies in Opus/Ogg.
- Channels: Feishu + Discord.
- Default voice:
juno (reference file: voice/juno_ref.wav). - Custom voice modes:
1) File-based: replace/update
voice/juno_ref.wav.
2) Registry-based: upload/register voices via
POST /voice/register, then call by
voice_name.
- - Output:
.opus (Ogg container) under .openclaw/media/outbound/voice-server-v3/ (or TARVIS_VOICE_OUTPUT_DIR). - Control scripts:
-
scripts/send_voice_reply.ps1 (server API path)
-
scripts/generate_cuda_voice.ps1 (stable local CUDA generation path)
Server implementation is kept with the skill (not workspace root):
- -
server/voice_server_v3.py (FastAPI routes) - INLINECODE11 (generation and cache engine)
Voice assets are also colocated with the skill:
Runtime requirements
- -
ffmpeg must be installed and available on PATH (required for Opus encoding). - Python packages required by the server:
-
fastapi
-
uvicorn
-
python-multipart
-
chatterbox-tts
-
torch
-
torchaudio
-
numpy
- - On first startup,
ChatterboxTTS.from_pretrained() may download model assets, so initial run can require network access and additional disk. - Optional env vars:
-
TARVIS_VOICE_OUTPUT_DIR to override where generated Opus files are written.
-
TARVIS_VOICE_DEVICE to force device selection (
cuda/
gpu,
mps, or
cpu).
Persistence behavior
- - Uploaded voice samples from
POST /voice/register are persisted under server/voices/. - Cache and registry data are persisted under
server/voice_cache/. - Generated Opus outputs are written under
.openclaw/media/outbound/voice-server-v3/ by default (or TARVIS_VOICE_OUTPUT_DIR when set). - INLINECODE34 only deletes staged
.opus files inside the configured output directory and their .json sidecar files.
Use this workflow
- 1. Ensure local v3.3 TTS server is running from this skill folder:
-
python -m uvicorn --app-dir server voice_server_v3:app --host 127.0.0.1 --port 8000
- 2. Call
/speak with text (and optional speed, exaggeration, cfg).
-
voice_name defaults to
juno.
- 3. Receive Opus directly from server (
audio/ogg) in Juno voice. - Save final media into allowed path:
-
C:\Users\hanli\.openclaw\media\outbound\
- 5. Send with
message tool:
-
action=send
-
filePath=<allowed-path>
-
asVoice=true
- For Feishu:
channel=feishu
- For Discord: INLINECODE52
Voice customization guide
A) Replace default Juno reference
- 1. Replace
voice/juno_ref.wav with your target reference voice sample. - Keep sample clean (single speaker, low noise, clear pronunciation).
- Restart server and test with
voice_name=juno.
B) Register additional named voices
- 1. Call
POST /voice/register with a reference sample and target voice_name. - Confirm registration under
server/voices/. - Generate with that
voice_name in /speak or /speak_stream.
Defaults
- -
voice_name: INLINECODE62 - INLINECODE63 : INLINECODE64
- Output format: Opus in Ogg container from server
/speak (no post-conversion) - Discord compatibility: Ogg/Opus is supported and can be sent as voice/audio with INLINECODE66
Speed Improvements In This Version
- - Caches model capability lookups once at startup.
- Uses
torch.inference_mode() during synthesis to reduce overhead. - Reuses phrase cache for both
/speak and /speak_stream. - Improves chunking behavior for long CJK text to avoid oversized chunks.
- Keeps latency metrics for benchmarking and tuning.
Common failure and fix
- - Error: INLINECODE70
- Fix: copy the file into
.openclaw/media/outbound before sending.
Script
Use scripts/send_voice_reply.ps1 to generate Opus directly with defaults (voice_name=juno, speed=1.2).
It auto-selects /speak_stream for longer text (or when -Stream is passed) for better throughput.
For stable CUDA generation command patterns under stricter exec approval policies, use:
This keeps the outer command shape fixed so
allow-always is more reusable.
本地语音回复
使用此技能将文本转换为克隆/自定义语音音频回复,并可靠地发送到飞书或Discord。
结构化技能定义
- - 用途:以Opus/Ogg格式进行本地低延迟语音回复。
- 渠道:飞书 + Discord。
- 默认语音:juno(参考文件:voice/juno_ref.wav)。
- 自定义语音模式:
1) 基于文件:替换/更新 voice/juno_ref.wav。
2) 基于注册:通过 POST /voice/register 上传/注册语音,然后通过 voice_name 调用。
- - 输出:.opus(Ogg容器),位于 .openclaw/media/outbound/voice-server-v3/(或 TARVISVOICEOUTPUT_DIR)。
- 控制脚本:
- scripts/send
voicereply.ps1(服务器API路径)
- scripts/generate
cudavoice.ps1(稳定的本地CUDA生成路径)
服务器实现与技能位于同一目录(非工作区根目录):
- - server/voiceserverv3.py(FastAPI路由)
- server/voice_engine.py(生成和缓存引擎)
语音资源也与技能位于同一目录:
运行时要求
- - 必须安装 ffmpeg 并确保其在 PATH 环境变量中可用(Opus编码所需)。
- 服务器所需的Python包:
- fastapi
- uvicorn
- python-multipart
- chatterbox-tts
- torch
- torchaudio
- numpy
- - 首次启动时,ChatterboxTTS.from_pretrained() 可能会下载模型资源,因此首次运行可能需要网络访问和额外磁盘空间。
- 可选环境变量:
- TARVIS
VOICEOUTPUT_DIR:覆盖生成的Opus文件的写入路径。
- TARVIS
VOICEDEVICE:强制选择设备(cuda/gpu、mps 或 cpu)。
持久化行为
- - 通过 POST /voice/register 上传的语音样本持久化存储在 server/voices/ 下。
- 缓存和注册数据持久化存储在 server/voicecache/ 下。
- 生成的Opus输出默认写入 .openclaw/media/outbound/voice-server-v3/(或设置 TARVISVOICEOUTPUTDIR 时写入该路径)。
- POST /output/cleanup 仅删除配置的输出目录中的暂存 .opus 文件及其 .json 侧车文件。
使用此工作流程
- 1. 确保从本技能文件夹运行本地 v3.3 TTS服务器:
- python -m uvicorn --app-dir server voice
serverv3:app --host 127.0.0.1 --port 8000
- 2. 调用 /speak,传入 text(以及可选的 speed、exaggeration、cfg)。
- voice_name 默认为 juno。
- 3. 从服务器接收 Opus格式(audio/ogg)的Juno语音回复。
- 将最终媒体文件保存到允许的路径:
- C:\Users\hanli\.openclaw\media\outbound\
- 5. 使用 message 工具发送:
- action=send
- filePath=<允许的路径>
- asVoice=true
- 飞书:channel=feishu
- Discord:channel=discord
语音自定义指南
A) 替换默认的Juno参考文件
- 1. 将 voice/junoref.wav 替换为目标参考语音样本。
- 保持样本干净(单人说话、低噪音、发音清晰)。
- 重启服务器并使用 voicename=juno 进行测试。
B) 注册额外的命名语音
- 1. 调用 POST /voice/register,传入参考样本和目标 voicename。
- 确认注册信息出现在 server/voices/ 下。
- 在 /speak 或 /speakstream 中使用该 voice_name 生成语音。
默认值
- - voice_name:juno
- speed:1.2
- 输出格式:服务器 /speak 返回的Ogg容器中的Opus格式(无需后转换)
- Discord兼容性:支持Ogg/Opus格式,可通过 asVoice=true 作为语音/音频发送
本版本的性能改进
- - 启动时仅缓存一次模型能力查询结果。
- 合成过程中使用 torch.inferencemode() 以减少开销。
- 为 /speak 和 /speakstream 复用短语缓存。
- 改进了长CJK文本的分块行为,避免过大的分块。
- 保留延迟指标,用于基准测试和调优。
常见故障及修复
- - 错误:LocalMediaAccessError ... path-not-allowed
- 修复:在发送前将文件复制到 .openclaw/media/outbound 目录下。
脚本
使用 scripts/sendvoicereply.ps1 以默认设置(voice_name=juno、speed=1.2)直接生成Opus文件。
对于较长文本(或传入 -Stream 参数时),它会自动选择 /speak_stream 以获得更好的吞吐量。
在更严格的执行审批策略下,如需稳定的CUDA生成命令模式,请使用:
- - scripts/generatecudavoice.ps1 -Text ...
这保持了外部命令形状的固定,使 allow-always 更具可复用性。