Local Voice Reply

Use this skill to turn text into a cloned/custom-voice audio reply and deliver it reliably to Feishu or Discord.

Structured skill definition

- Purpose: local low-latency voice replies in Opus/Ogg.
Channels: Feishu + Discord.
Default voice: juno (reference file: voice/juno_ref.wav).
Custom voice modes:

1) File-based: replace/update voice/juno_ref.wav. 2) Registry-based: upload/register voices via POST /voice/register, then call by voice_name.

- Output: .opus (Ogg container) under .openclaw/media/outbound/voice-server-v3/ (or TARVIS_VOICE_OUTPUT_DIR).
Control scripts:

- scripts/send_voice_reply.ps1 (server API path) - scripts/generate_cuda_voice.ps1 (stable local CUDA generation path)

Server implementation is kept with the skill (not workspace root):

- server/voice_server_v3.py (FastAPI routes)
INLINECODE11 (generation and cache engine)

Voice assets are also colocated with the skill:

- INLINECODE12

Runtime requirements

- ffmpeg must be installed and available on PATH (required for Opus encoding).
Python packages required by the server:

- fastapi - uvicorn - python-multipart - chatterbox-tts - torch - torchaudio - numpy

- On first startup, ChatterboxTTS.from_pretrained() may download model assets, so initial run can require network access and additional disk.
Optional env vars:

- TARVIS_VOICE_OUTPUT_DIR to override where generated Opus files are written. - TARVIS_VOICE_DEVICE to force device selection (cuda/gpu, mps, or cpu).

Persistence behavior

- Uploaded voice samples from POST /voice/register are persisted under server/voices/.
Cache and registry data are persisted under server/voice_cache/.
Generated Opus outputs are written under .openclaw/media/outbound/voice-server-v3/ by default (or TARVIS_VOICE_OUTPUT_DIR when set).
INLINECODE34 only deletes staged .opus files inside the configured output directory and their .json sidecar files.

Use this workflow

1. Ensure local v3.3 TTS server is running from this skill folder:

- python -m uvicorn --app-dir server voice_server_v3:app --host 127.0.0.1 --port 8000

2. Call /speak with text (and optional speed, exaggeration, cfg).

- voice_name defaults to juno.

3. Receive Opus directly from server (audio/ogg) in Juno voice.
Save final media into allowed path:

- C:\Users\hanli\.openclaw\media\outbound\

5. Send with message tool:

- action=send - filePath=<allowed-path> - asVoice=true - For Feishu: channel=feishu - For Discord: INLINECODE52

Voice customization guide

A) Replace default Juno reference

1. Replace voice/juno_ref.wav with your target reference voice sample.
Keep sample clean (single speaker, low noise, clear pronunciation).
Restart server and test with voice_name=juno.

B) Register additional named voices

1. Call POST /voice/register with a reference sample and target voice_name.
Confirm registration under server/voices/.
Generate with that voice_name in /speak or /speak_stream.

Defaults

- voice_name: INLINECODE62
INLINECODE63: INLINECODE64
Output format: Opus in Ogg container from server /speak (no post-conversion)
Discord compatibility: Ogg/Opus is supported and can be sent as voice/audio with INLINECODE66

Speed Improvements In This Version

- Caches model capability lookups once at startup.
Uses torch.inference_mode() during synthesis to reduce overhead.
Reuses phrase cache for both /speak and /speak_stream.
Improves chunking behavior for long CJK text to avoid oversized chunks.
Keeps latency metrics for benchmarking and tuning.

Common failure and fix

- Error: INLINECODE70
Fix: copy the file into .openclaw/media/outbound before sending.

Script

Use scripts/send_voice_reply.ps1 to generate Opus directly with defaults (voice_name=juno, speed=1.2).
It auto-selects /speak_stream for longer text (or when -Stream is passed) for better throughput.

For stable CUDA generation command patterns under stricter exec approval policies, use:

- INLINECODE77

This keeps the outer command shape fixed so allow-always is more reusable.

本地语音回复

使用此技能将文本转换为克隆/自定义语音音频回复，并可靠地发送到飞书或Discord。

结构化技能定义

- 用途：以Opus/Ogg格式进行本地低延迟语音回复。
渠道：飞书 + Discord。
默认语音：juno（参考文件：voice/juno_ref.wav）。
自定义语音模式：

1) 基于文件：替换/更新 voice/juno_ref.wav。 2) 基于注册：通过 POST /voice/register 上传/注册语音，然后通过 voice_name 调用。

- 输出：.opus（Ogg容器），位于 .openclaw/media/outbound/voice-server-v3/（或 TARVISVOICEOUTPUT_DIR）。
控制脚本：

- scripts/sendvoicereply.ps1（服务器API路径） - scripts/generatecudavoice.ps1（稳定的本地CUDA生成路径）

服务器实现与技能位于同一目录（非工作区根目录）：

- server/voiceserverv3.py（FastAPI路由）
server/voice_engine.py（生成和缓存引擎）

语音资源也与技能位于同一目录：

- voice/

运行时要求

- 必须安装 ffmpeg 并确保其在 PATH 环境变量中可用（Opus编码所需）。
服务器所需的Python包：

- fastapi - uvicorn - python-multipart - chatterbox-tts - torch - torchaudio - numpy

- 首次启动时，ChatterboxTTS.from_pretrained() 可能会下载模型资源，因此首次运行可能需要网络访问和额外磁盘空间。
可选环境变量：

- TARVISVOICEOUTPUT_DIR：覆盖生成的Opus文件的写入路径。 - TARVISVOICEDEVICE：强制选择设备（cuda/gpu、mps 或 cpu）。

持久化行为

- 通过 POST /voice/register 上传的语音样本持久化存储在 server/voices/ 下。
缓存和注册数据持久化存储在 server/voicecache/ 下。
生成的Opus输出默认写入 .openclaw/media/outbound/voice-server-v3/（或设置 TARVISVOICEOUTPUTDIR 时写入该路径）。
POST /output/cleanup 仅删除配置的输出目录中的暂存 .opus 文件及其 .json 侧车文件。

使用此工作流程

1. 确保从本技能文件夹运行本地 v3.3 TTS服务器：

- python -m uvicorn --app-dir server voiceserverv3:app --host 127.0.0.1 --port 8000

2. 调用 /speak，传入 text（以及可选的 speed、exaggeration、cfg）。

- voice_name 默认为 juno。

3. 从服务器接收 Opus格式（audio/ogg）的Juno语音回复。
将最终媒体文件保存到允许的路径：

- C:\Users\hanli\.openclaw\media\outbound\

5. 使用 message 工具发送：

- action=send - filePath=<允许的路径> - asVoice=true - 飞书：channel=feishu - Discord：channel=discord

语音自定义指南

A) 替换默认的Juno参考文件

1. 将 voice/junoref.wav 替换为目标参考语音样本。
保持样本干净（单人说话、低噪音、发音清晰）。
重启服务器并使用 voicename=juno 进行测试。

B) 注册额外的命名语音

1. 调用 POST /voice/register，传入参考样本和目标 voicename。
确认注册信息出现在 server/voices/ 下。
在 /speak 或 /speakstream 中使用该 voice_name 生成语音。

默认值

- voice_name：juno
speed：1.2
输出格式：服务器 /speak 返回的Ogg容器中的Opus格式（无需后转换）
Discord兼容性：支持Ogg/Opus格式，可通过 asVoice=true 作为语音/音频发送

本版本的性能改进

- 启动时仅缓存一次模型能力查询结果。
合成过程中使用 torch.inferencemode() 以减少开销。
为 /speak 和 /speakstream 复用短语缓存。
改进了长CJK文本的分块行为，避免过大的分块。
保留延迟指标，用于基准测试和调优。

常见故障及修复

- 错误：LocalMediaAccessError ... path-not-allowed
修复：在发送前将文件复制到 .openclaw/media/outbound 目录下。

脚本

使用 scripts/sendvoicereply.ps1 以默认设置（voice_name=juno、speed=1.2）直接生成Opus文件。
对于较长文本（或传入 -Stream 参数时），它会自动选择 /speak_stream 以获得更好的吞吐量。

在更严格的执行审批策略下，如需稳定的CUDA生成命令模式，请使用：

- scripts/generatecudavoice.ps1 -Text ...

这保持了外部命令形状的固定，使 allow-always 更具可复用性。

local-voice-reply本地语音回复