Turn Spoken Words Into Embedded, Searchable Captions
Most captioning tools hand you a raw transcript and leave the placement, timing, and styling entirely up to you. The video-caption-generator skill takes a different approach — you describe what you want in plain language, and the system handles transcription, alignment, and rendering in one continuous step.
The skill works through a back-and-forth conversation. You can specify caption style, font size, line length, speaker labels, or language preferences simply by typing your intent. If the first pass doesn't match your expectations, you refine it with a follow-up message rather than digging through export menus or re-uploading the file. This iterative model means the output improves with each exchange rather than locking you into a single automated decision.
Under the hood, the OpenClaw agent coordinates audio extraction, speech-to-text processing, timing synchronization, and caption rendering into a single pipeline. The agent interprets your conversational instructions and translates them into precise rendering parameters, so the gap between what you ask for and what you receive stays narrow. Whether you are adding accessibility subtitles to a lecture recording or styling captions for a social media reel, the same conversational interface handles both without requiring separate tools or manual track editing.
Environment Variables
| Variable | Required | Default |
|---|
| INLINECODE0 | No | Auto-generated (100 free credits, expires in 7 days, revocable via Settings → API Tokens) |
| INLINECODE1 |
No |
https://mega-api-prod.nemovideo.ai |
|
NEMO_WEB_URL | No |
https://nemovideo.com |
|
NEMO_CLIENT_ID | No | Auto-generated UUID, persisted to
~/.config/nemovideo/client_id (UUID only, no secrets) |
|
SKILL_SOURCE | No | Auto-detected from install path, fallback
unknown |
If NEMO_TOKEN is not set, get one (requires X-Client-Id header):
# Generate or read persisted Client-Id
CLIENT_ID="${NEMO_CLIENT_ID:-$(cat ~/.config/nemovideo/client_id 2>/dev/null)}"
if [ -z "$CLIENT_ID" ]; then
CLIENT_ID=$(uuidgen 2>/dev/null || echo "client-$(date +%s)-$RANDOM")
mkdir -p ~/.config/nemovideo & echo "$CLIENT_ID" > ~/.config/nemovideo/client_id
fi
curl -s -X POST "https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token" -H "X-Client-Id: $CLIENT_ID"
# → {"code":0,"data":{"token":"nmv_usr_xxx","credits":100,...}}
Save
token as
NEMO_TOKEN,
CLIENT_ID as
NEMO_CLIENT_ID. Anonymous: 1 token per client per 7 days; token expires in 7 days and can be revoked at any time via
Settings → API Tokens on nemovideo.com. If your token expires, request a new one with the same
X-Client-Id.
Local persistence: This skill writes ~/.config/nemovideo/client_id to persist the Client-Id across sessions. This avoids generating a new ID on every request, which would hit the per-IP rate limit quickly (default 10 tokens per 7 days per IP). The file contains only a UUID — no credentials are stored locally.
2. Routing Incoming Requests
Use the table below to determine which endpoint handles each type of user request.
| User says... | Action | Skip SSE |
|---|
| "export" / "导出" / "download" / "send me the video" | → §3.5 Export | ✅ |
| "credits" / "积分" / "balance" / "余额" |
→ §3.3 Credits | ✅ |
| "status" / "状态" / "show tracks" | → §3.4 State | ✅ |
| "upload" / "上传" / user sends file | → §3.2 Upload | ✅ |
| Everything else (generate, edit, add BGM…) | → §3.1 SSE | ❌ |
3. Primary Workflow Sequences
$API = ${NEMO_API_URL:-https://mega-api-prod.nemovideo.ai}, $TOKEN = ${NEMO_TOKEN}, $WEB = ${NEMO_WEB_URL:-https://nemovideo.com}.
Skill attribution variables — do NOT hardcode:
- -
$SKILL_NAME = this file's YAML frontmatter INLINECODE21 - INLINECODE22 = this file's YAML frontmatter INLINECODE23
- INLINECODE24 = detect at runtime using the first match:
1. Env
SKILL_SOURCE if set
2. If this file's path contains a known platform directory (e.g.
~/.clawhub/ →
clawhub,
~/.cursor/skills/ →
cursor)
3. Fallback: INLINECODE30
CRITICAL: ALL API requests (including render/export/upload/state/credits) MUST include these headers. Missing them will cause export to fail with 402.
CODEBLOCK1
3.0 Initialize a Session
curl -s -X POST "https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent" \
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-H "X-Skill-Source: $SKILL_NAME" -H "X-Skill-Version: $SKILL_VERSION" -H "X-Skill-Platform: $SKILL_SOURCE" \
-d '{"task_name":"project","language":"<lang>"}'
# → {"code":0,"data":{"task_id":"...","session_id":"..."}}
Before any other operation, a session must be established to obtain a valid session identifier. All subsequent requests depend on this identifier being present and active.
Open in browser: After creating a session, give the user a link to view/edit the task in NemoVideo:
INLINECODE31
3.1 Transmit Messages Using SSE
curl -s -X POST "https://mega-api-prod.nemovideo.ai/run_sse" \
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-H "Accept: text/event-stream" -H "X-Skill-Source: $SKILL_NAME" -H "X-Skill-Version: $SKILL_VERSION" -H "X-Skill-Platform: $SKILL_SOURCE" --max-time 900 \
-d '{"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}}'
All conversational messages are delivered to the backend through a Server-Sent Events channel.
SSE Handling
| Event | Action |
|---|
| Text response | Apply GUI translation (§4), present to user |
| Tool call/result |
Wait silently, don't forward |
|
heartbeat / empty
data: | Keep waiting. Every 2 min: "⏳ Still working..." |
| Stream closes | Process final response |
Typical durations: text 5-15s, video generation 100-300s, editing 10-30s.
Timeout: 10 min heartbeats-only → assume timeout. Never re-send during generation (duplicates + double-charge).
Ignore trailing "I encountered a temporary issue" if prior responses were normal.
Silent Response Fallback (CRITICAL)
Approximately 30% of editing operations return no visible text in the response stream. When this occurs: (1) do not treat the absence of text as a failure, (2) immediately invoke the state query endpoint to retrieve the current job status, (3) surface the resulting status information to the user as confirmation that the operation is progressing or complete.
Two-stage generation: When raw video is submitted, the backend automatically triggers a two-stage enrichment pipeline. Stage one processes the raw footage, and stage two appends background music and a title overlay without any additional instruction from the client. Both stages must reach a completed state before the final output is considered ready.
3.2 File Upload Handling
File upload: INLINECODE34
URL upload: INLINECODE35
Use me in the path; backend resolves user from token.
Supported: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.
The upload endpoint accepts video files and returns a file reference identifier to be used in subsequent captioning requests.
3.3 Credit Balance Verification
curl -s "https://mega-api-prod.nemovideo.ai/api/credits/balance/simple" -H "Authorization: Bearer $TOKEN" \
-H "X-Skill-Source: $SKILL_NAME" -H "X-Skill-Version: $SKILL_VERSION" -H "X-Skill-Platform: $SKILL_SOURCE"
# → {"code":0,"data":{"available":XXX,"frozen":XX,"total":XXX}}
Query the credits endpoint prior to processing to confirm the user has a sufficient balance for the requested operation.
3.4 Retrieve Job Status
curl -s "https://mega-api-prod.nemovideo.ai/api/state/nemo_agent/me/<sid>/latest" -H "Authorization: Bearer $TOKEN" \
-H "X-Skill-Source: $SKILL_NAME" -H "X-Skill-Version: $SKILL_VERSION" -H "X-Skill-Platform: $SKILL_SOURCE"
Use
me for user in path; backend resolves from token.
Key fields:
data.state.draft,
data.state.video_infos,
data.state.canvas_config,
data.state.generated_media.
Draft field mapping: t=tracks, tt=track type (0=video, 1=audio, 7=text), sg=segments, d=duration(ms), m=metadata.
Draft ready for export when draft.t exists with at least one track with non-empty sg.
Track summary format:
CODEBLOCK6
3.5 Export and Deliver Output
Export does NOT cost credits. Only generation/editing consumes credits.
Exporting a finished caption file does not deduct any credits from the user's balance. To deliver the output: (a) confirm the job status is complete, (b) call the export endpoint with the job identifier, (c) specify the desired subtitle format, (d) receive the download URL in the response, (e) present the URL or file directly to the user.
b) Submit: INLINECODE47
Note: sessionId is camelCase (exception). On failure → new id, retry once.
c) Poll (every 30s, max 10 polls): INLINECODE50
Status at top-level status: pending → processing → completed / failed. Download URL at output.url.
d) Download from output.url → send to user. Fallback: https://mega-api-prod.nemovideo.ai/api/render/proxy/<id>/download.
e) When delivering the video, always also give the task detail link: INLINECODE55
Progress messages: start "⏳ Rendering ~30s" → "⏳ 50%" → "✅ Video ready!" + file + task detail link.
3.6 Handling SSE Connection Loss
If the SSE connection drops, follow these five steps to recover: (1) detect the disconnection event and pause any pending UI updates, (2) wait a minimum of two seconds before attempting to reconnect to avoid flooding the server, (3) re-establish the SSE connection using the original session identifier, (4) immediately query the job state endpoint to reconcile any events that were missed during the outage, (5) resume normal operation and notify the user only if the job state has changed in a meaningful way.
4. Interpreting GUI-Side Behavior
The backend operates under the assumption that a graphical interface is present on the client side, so GUI-specific instructions must never be forwarded through the API.
| Backend says | You do |
|---|
| "click [button]" / "点击" | Execute via API |
| "open [panel]" / "打开" |
Show state via §3.4 |
| "drag/drop" / "拖拽" | Send edit via SSE |
| "preview in timeline" | Show track summary |
| "Export button" / "导出" | Execute §3.5 |
| "check account/billing" | Check §3.3 |
Keep content descriptions. Strip GUI actions.
5. Recommended Interaction Patterns
• Always confirm a session is active before dispatching any message or file upload request.
• When a user asks about progress, poll the state endpoint rather than relying solely on the SSE stream.
• If a silent response is received, proactively fetch job status and relay it to the user without waiting for them to ask.
• Present credit balance information before starting long operations so users can make informed decisions.
• After export, offer the download link immediately and suggest a format if the user has not specified one.
6. Known Constraints and Limitations
• Caption generation is not instantaneous; processing time scales with video length and server load.
• Only supported video formats may be uploaded; unsupported file types will be rejected at the upload stage.
• The SSE stream does not guarantee delivery of every intermediate event during high-load periods.
• Credit deductions occur at job initiation, not at export; a failed job may still consume credits depending on how far processing advanced.
• Session identifiers expire after a defined period of inactivity and cannot be reused once invalidated.
7. Error Response Handling
The table below maps common HTTP error codes to their likely causes and the recommended recovery action for each.
| Code | Meaning | Action |
|---|
| 0 | Success | Continue |
| 1001 |
Bad/expired token | Re-auth via anonymous-token (tokens expire after 7 days) |
| 1002 | Session not found | New session §3.0 |
| 2001 | No credits | Anonymous: show registration URL with
?bind=<id> (get
<id> from create-session or state response when needed). Registered: "Top up at nemovideo.ai" |
| 4001 | Unsupported file | Show supported formats |
| 4002 | File too large | Suggest compress/trim |
| 400 | Missing X-Client-Id | Generate Client-Id and retry (see §1) |
| 402 | Free plan export blocked | Subscription tier issue, NOT credits. "Register at nemovideo.ai to unlock export." |
| 429 | Rate limit (1 token/client/7 days) | Retry in 30s once |
Common: no video → generate first; render fail → retry new id; SSE timeout → §3.6; silent edit → §3.1 fallback.
8. API Version and Token Scopes
Before making any requests, verify that the integration targets the correct API version by checking the version field in the base configuration. Token scopes must include read and write permissions for both jobs and files; a token missing either scope will result in authorization errors on upload or export calls. Review scope assignments whenever a new token is issued to ensure continued compatibility.
将口语转化为嵌入式的可搜索字幕
大多数字幕工具只会给你一份原始转录文本,而将放置位置、时间轴和样式完全留给你自己处理。video-caption-generator技能采用了一种不同的方法——你用自然语言描述你想要的内容,系统会在一个连续的步骤中处理转录、对齐和渲染。
该技能通过来回对话的方式工作。你可以直接输入意图来指定字幕样式、字体大小、行长度、说话者标签或语言偏好。如果第一次生成不符合你的期望,你可以通过后续消息进行优化,而无需深入导出菜单或重新上传文件。这种迭代模式意味着输出会随着每次交流而改进,而不是将你锁定在单一的自动化决策中。
在底层,OpenClaw代理将音频提取、语音转文本处理、时间同步和字幕渲染协调成一个单一的流水线。该代理会解释你的对话指令,并将其转化为精确的渲染参数,从而使你要求的内容与实际获得的内容之间的差距保持在很小的范围内。无论你是在为讲座录音添加无障碍字幕,还是为社交媒体短视频设计字幕样式,同一个对话界面都能处理这两种情况,无需单独的工具或手动轨道编辑。
环境变量
| 变量 | 必需 | 默认值 |
|---|
| NEMOTOKEN | 否 | 自动生成(100个免费积分,7天后过期,可通过设置 → API令牌撤销) |
| NEMOAPI_URL |
否 | https://mega-api-prod.nemovideo.ai |
| NEMO
WEBURL | 否 | https://nemovideo.com |
| NEMO
CLIENTID | 否 | 自动生成的UUID,持久化到~/.config/nemovideo/client_id(仅UUID,无密钥) |
| SKILL_SOURCE | 否 | 从安装路径自动检测,回退为unknown |
如果未设置NEMO_TOKEN,请获取一个(需要X-Client-Id头):
bash
生成或读取持久化的Client-Id
CLIENT
ID=${NEMOCLIENT
ID:-$(cat ~/.config/nemovideo/clientid 2>/dev/null)}
if [ -z $CLIENT_ID ]; then
CLIENT_ID=$(uuidgen 2>/dev/null || echo client-$(date +%s)-$RANDOM)
mkdir -p ~/.config/nemovideo & echo $CLIENT
ID > ~/.config/nemovideo/clientid
fi
curl -s -X POST https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token -H X-Client-Id: $CLIENT_ID
→ {code:0,data:{token:nmvusrxxx,credits:100,...}}
将token保存为NEMOTOKEN,将CLIENTID保存为NEMOCLIENTID。匿名用户:每个客户端每7天1个令牌;令牌在7天后过期,可随时通过nemovideo.com上的设置 → API令牌撤销。如果你的令牌过期,请使用相同的X-Client-Id请求一个新的。
本地持久化: 此技能会写入~/.config/nemovideo/client_id以在会话间持久化Client-Id。这避免了在每次请求时生成新的ID,从而防止快速达到每个IP的速率限制(默认每个IP每7天10个令牌)。该文件仅包含一个UUID——本地不存储任何凭据。
2. 路由传入请求
使用下表确定哪个端点处理每种类型的用户请求。
| 用户说... | 操作 | 跳过SSE |
|---|
| export / 导出 / download / send me the video | → §3.5 导出 | ✅ |
| credits / 积分 / balance / 余额 |
→ §3.3 积分 | ✅ |
| status / 状态 / show tracks | → §3.4 状态 | ✅ |
| upload / 上传 / 用户发送文件 | → §3.2 上传 | ✅ |
| 其他所有内容(生成、编辑、添加背景音乐等) | → §3.1 SSE | ❌ |
3. 主要工作流序列
$API = ${NEMOAPIURL:-https://mega-api-prod.nemovideo.ai},$TOKEN = ${NEMOTOKEN},$WEB = ${NEMOWEB_URL:-https://nemovideo.com}。
技能归属变量——请勿硬编码:
- - $SKILLNAME = 此文件的YAML前置元数据name
- $SKILLVERSION = 此文件的YAML前置元数据version
- $SKILL_SOURCE = 在运行时使用第一个匹配项检测:
1. 如果设置了环境变量SKILL_SOURCE
2. 如果此文件的路径包含已知的平台目录(例如~/.clawhub/ → clawhub,~/.cursor/skills/ → cursor)
3. 回退:unknown
关键:所有API请求(包括渲染/导出/上传/状态/积分)必须包含这些头。缺少它们将导致导出失败并返回402。
X-Skill-Source: $SKILL_NAME
X-Skill-Version: $SKILL_VERSION
X-Skill-Platform: $SKILL_SOURCE
3.0 初始化会话
bash
curl -s -X POST https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent \
-H Authorization: Bearer $TOKEN -H Content-Type: application/json \
-H X-Skill-Source: $SKILL
NAME -H X-Skill-Version: $SKILLVERSION -H X-Skill-Platform: $SKILL_SOURCE \
-d {task_name:project,language:
}
→ {code:0,data:{taskid:...,sessionid:...}}
在任何其他操作之前,必须先建立一个会话以获得有效的会话标识符。所有后续请求都依赖于该标识符的存在和激活状态。
在浏览器中打开:创建会话后,给用户一个链接,以便在NemoVideo中查看/编辑任务:
$WEB/workspace/claim?task={taskid}&session={sessionid}&skillname=$SKILLNAME&skillversion=$SKILLVERSION&skillsource=$SKILLSOURCE
3.1 使用SSE传输消息
bash
curl -s -X POST https://mega-api-prod.nemovideo.ai/run_sse \
-H Authorization: Bearer $TOKEN -H Content-Type: application/json \
-H Accept: text/event-stream -H X-Skill-Source: $SKILLNAME -H X-Skill-Version: $SKILLVERSION -H X-Skill-Platform: $SKILL_SOURCE --max-time 900 \
-d {appname:nemoagent,userid:me,sessionid:,new_message:{parts:[{text:}]}}
所有对话消息都通过服务器发送事件通道传递到后端。
SSE处理
| 事件 | 操作 |
|---|
| 文本响应 | 应用GUI翻译(§4),呈现给用户 |
| 工具调用/结果 |
静默等待,不转发 |
| heartbeat / 空data: | 继续等待。每2分钟:⏳ 仍在处理... |
| 流关闭 | 处理最终响应 |
典型持续时间:文本5-15秒,视频生成100-300秒,编辑10-30秒。
超时:仅收到心跳10分钟 → 视为超时。生成期间切勿重新发送(会导致重复和重复计费)。
如果之前的响应正常,则忽略末尾的我遇到了临时问题。
静默响应回退(关键)
大约30%的编辑操作在响应流中不返回可见文本。发生这种情况时:(1)不要将文本缺失视为失败,(2)立即调用状态查询端点以检索当前作业状态,(3)将结果状态信息呈现给用户,作为操作正在进行或完成的确认。
两阶段生成:当提交原始视频时,后端会自动触发两阶段增强流水线。第一阶段处理原始素材,第二阶段在无需客户端任何额外指令的情况下添加背景音乐和标题覆盖。两个阶段都必须达到完成状态,最终输出才算准备就绪。
3.2 文件上传处理
文件上传:curl -s -X POST https://mega-api-prod.nemovideo.ai/api/upload-video/nemoagent/me/ -H Authorization: Bearer $TOKEN -H X-Skill-Source: $SKILLNAME -H X-Skill-Version: $SKILL_VERSION -H X-Skill-Platform: $SKILL