OpenClaw Ops
Operational health diagnostics and design patterns for OpenClaw agents. This skill helps you diagnose why an agent stopped responding, fix the root cause, and install preventive guardrails so it doesn't happen again.
It covers two complementary areas: ops diagnostics (find and fix failures now) and design patterns (prevent failures structurally). It is opinionated toward the safest reliable path first, with break-glass recovery reserved for true gateway lockups.
Quick Start: Agent Not Responding?
Run through this triage in order. Most outages are caused by the top 3.
- 1. Lane deadlock — A cron job on
agent:main:main blocks all interactive messages. Check ~/.openclaw/cron/jobs.json for any job with "sessionKey": "agent:main:main". Fix: change to agent:main:cron:<job-name> or agent:main:isolated. - Session bloat — A session file over 5MB causes compaction timeouts (600s limit). Check
~/.openclaw/agents/main/sessions/ for large .jsonl files. Fix: archive the session and remove from sessions.json. - Bootstrap truncation —
AGENTS.md exceeds the 20,000 char limit, causing compaction to timeout on the bloated bootstrap. Check: wc -c < ~/.openclaw/workspace/AGENTS.md. Fix: move verbose sections to AGENTS-REFERENCE.md. - Auth failure — Missing provider key in agent-level
auth-profiles.json. Check gateway.err.log for FailoverError: No API key found. Fix: add the profile to ~/.openclaw/agents/main/agent/auth-profiles.json. - Gateway heap OOM — Node.js runs out of memory processing oversized sessions. Check gateway.err.log for
FATAL ERROR: Reached heap limit. Fix: clear the bloated session first, then restart gateway.
If none of these match, read references/failure-patterns.md for the full catalog of 10 failure categories extracted from real gateway logs.
Bundled Scripts
Two battle-tested bash scripts are included in scripts/. They can be run standalone or registered as OpenClaw cron jobs.
session-health-watchdog.sh
Comprehensive health check that monitors five areas in a single pass:
- - Session file sizes (warn at 5MB, critical at 10MB)
- Stale session locks (warn at 10min, critical at 30min)
- Cron jobs routing to
agent:main:main (always critical) - AGENTS.md bootstrap budget utilization
- Recent stuck-session warnings in gateway.err.log (last 15 min)
Run standalone: INLINECODE18
Register as cron (recommended every 30 min):
CODEBLOCK0
The watchdog outputs either "all clear" with stats, or an alert summary with severity levels. It exits non-zero when problems are found, making it suitable for monitoring pipelines.
bootstrap-budget-check.sh
Detailed AGENTS.md size analysis with section-by-section breakdown showing which sections consume the most budget. Uses visual bars and threshold alerts.
Thresholds:
- - Green: below 75% — plenty of headroom
- Yellow: 75-85% — watch growth
- Orange: 85-95% — consolidate soon
- Red: 95%+ — will truncate, trim immediately
Run standalone: INLINECODE19
Session Lane Architecture
Understanding session lanes is the single most important concept for OpenClaw reliability. Every message, cron job, and subagent runs in a "session lane" — a keyed slot that can only process one task at a time.
The golden rule: agent:main:main is the interactive lane. Never put cron jobs on it.
When a cron job runs on the main lane, every interactive message (Telegram, Discord, CLI) queues behind it. If the cron runs for 2+ minutes or gets stuck, the agent appears dead.
Lane naming conventions
| Use case | Session key pattern | Example |
|---|
| Interactive (Telegram, CLI) | INLINECODE21 | Reserved — never use for crons |
| Cron jobs |
agent:main:cron:<job-name> |
agent:main:cron:recall-archiver |
| Isolated one-shots |
agent:main:isolated | Disposable tasks |
| Channel-specific |
agent:main:telegram:group:<id> | Per-chat sessions |
Validating cron lanes
Before deploying any cron job, verify its session key:
CODEBLOCK1
Auth Configuration
OpenClaw uses a two-layer auth system. Global config (openclaw.json) declares available providers, but the agent needs its own auth-profiles.json to actually authenticate.
Common mistake: Adding a provider to global config but forgetting the agent-level profile. The gateway silently tries and fails, logging FailoverError: No API key found for provider "X".
Fix pattern:
CODEBLOCK2
After adding, verify: grep "FailoverError.*API key" ~/.openclaw/logs/gateway.err.log | tail -5
Multi-Model Failover
OpenClaw routes through model providers in a failover chain. When the primary model rate-limits or times out, it cascades to the next. Understanding this chain prevents false alarms.
Common failover errors (from real logs):
- -
FailoverError: LLM request timed out — provider slow, failover kicked in (238 occurrences in 3 weeks) - INLINECODE31 — 429 from provider (68 occurrences)
- INLINECODE32 — 503/529 from provider (106 occurrences)
- INLINECODE33 — client-side cancellation (58 occurrences)
These are expected in a multi-model setup. They only become problems when ALL providers in the chain fail simultaneously, or when cooldown windows overlap causing no provider to be available.
Diagnostic: Check if a specific provider is consistently failing:
CODEBLOCK3
AGENTS.md Management
The bootstrap file (AGENTS.md) loads into every new session. It has a hard 20,000 character limit. Exceeding it causes silent truncation, which breaks instructions and makes compaction unreliable.
Budget strategy:
- 1. Keep AGENTS.md lean — identity, core rules, critical workflows only
- Move verbose details to
AGENTS-REFERENCE.md — the agent can read this on demand - Use
scripts/bootstrap-budget-check.sh to monitor section-by-section usage - Target 50-70% utilization to leave room for organic growth
Red flags that AGENTS.md is too big:
- - Compaction timeouts on sessions that aren't particularly large
- Agent "forgets" instructions from later sections of AGENTS.md
- New sessions start with truncated or garbled context
Design Patterns Applied
This skill applies six patterns from the Agentic Design Patterns framework (Gulli, 2025). Understanding why each guardrail exists helps you adapt them to your setup.
For the full pattern descriptions, read references/design-patterns.md.
| Pattern | Where applied | What it prevents |
|---|
| Routing (P2) | Lane architecture | Cron/interactive deadlocks |
| Exception Handling (P12) |
Watchdog alerts | Silent failures going unnoticed |
|
Goal Monitoring (P11) | Bootstrap budget check | Instruction truncation |
|
Resource-Aware Optimization (P16) | Session size limits | OOM crashes and compaction timeouts |
|
Evaluation & Monitoring (P19) | Log scanning | Pattern detection across error categories |
|
Prioritization (P20) | Triage order | Fix highest-impact issue first |
Runbook: Common Scenarios
For the full diagnostic runbook with step-by-step commands for each failure category, read references/failure-patterns.md. Here are the most frequent:
Gateway won't accept CLI connections
The gateway is alive but locked processing a stuck task. Check for stuck sessions, then use the safest available recovery path:
CODEBLOCK4
Break-glass recovery (last resort only): if the gateway is too wedged for normal cron commands and a bad cron on agent:main:main is repeatedly deadlocking the system, make a timestamped backup of ~/.openclaw/cron/jobs.json, remove or disable only the offending job, validate the JSON with jq, then restart the gateway. Treat direct jobs.json edits as emergency recovery, not routine operations.
Compaction keeps timing out
Usually means a session is too large OR AGENTS.md is bloated. Check both:
CODEBLOCK5
Agent responds but ignores instructions
AGENTS.md is likely truncated. Run
bootstrap-budget-check.sh to see utilization. If above 90%, move sections to AGENTS-REFERENCE.md immediately.
Tool writes fail with "Path escapes sandbox"
The agent tried to write to
/tmp/ or another path outside
~/.openclaw/workspace/. All file operations must stay within the workspace sandbox. Teach the agent to use workspace-relative paths in AGENTS.md.
OpenClaw Ops
OpenClaw代理的操作健康诊断与设计模式。此技能帮助您诊断代理停止响应的原因,修复根本原因,并安装预防性护栏以防止问题再次发生。
它涵盖两个互补领域:运维诊断(立即发现并修复故障)和设计模式(从结构上预防故障)。它优先选择最安全可靠的路径,将紧急恢复方案保留给真正的网关死锁情况。
快速入门:代理无响应?
按顺序执行以下分类排查。大多数中断由前3种原因引起。
- 1. 通道死锁 — agent:main:main上的定时任务阻塞所有交互消息。检查~/.openclaw/cron/jobs.json中是否有sessionKey: agent:main:main的任务。修复:改为agent:main:cron:或agent:main:isolated。
- 会话膨胀 — 超过5MB的会话文件导致压缩超时(600秒限制)。检查~/.openclaw/agents/main/sessions/中的大型.jsonl文件。修复:归档会话并从sessions.json中移除。
- 引导截断 — AGENTS.md超过20,000字符限制,导致压缩在膨胀的引导文件上超时。检查:wc -c < ~/.openclaw/workspace/AGENTS.md。修复:将冗长部分移至AGENTS-REFERENCE.md。
- 认证失败 — 代理级别的auth-profiles.json中缺少提供商密钥。检查gateway.err.log中的FailoverError: No API key found。修复:将配置文件添加到~/.openclaw/agents/main/agent/auth-profiles.json。
- 网关堆内存溢出 — Node.js在处理过大会话时内存耗尽。检查gateway.err.log中的FATAL ERROR: Reached heap limit。修复:先清除膨胀的会话,然后重启网关。
如果以上都不匹配,请阅读references/failure-patterns.md获取从真实网关日志中提取的10个故障类别的完整目录。
捆绑脚本
scripts/中包含两个经过实战检验的bash脚本。它们可以独立运行,也可以注册为OpenClaw定时任务。
session-health-watchdog.sh
一次性监控五个领域的全面健康检查:
- - 会话文件大小(5MB警告,10MB严重)
- 过期会话锁(10分钟警告,30分钟严重)
- 路由到agent:main:main的定时任务(始终严重)
- AGENTS.md引导预算使用情况
- gateway.err.log中最近的卡顿会话警告(最近15分钟)
独立运行: bash scripts/session-health-watchdog.sh
注册为定时任务(建议每30分钟):
bash
openclaw cron add \
--name Session Health Watchdog \
--prompt Run the session health watchdog and report results \
--cron /30 * \
--session-key agent:main:cron:health-watchdog \
--model openai-codex/gpt-5.4
看门狗输出all clear及统计信息,或带有严重级别的警报摘要。发现问题时以非零状态退出,适合用于监控管道。
bootstrap-budget-check.sh
详细的AGENTS.md大小分析,逐部分分解显示哪些部分消耗最多预算。使用可视化条形图和阈值警报。
阈值:
- - 绿色:低于75% — 有充足余量
- 黄色:75-85% — 注意增长
- 橙色:85-95% — 尽快整合
- 红色:95%以上 — 将被截断,立即修剪
独立运行: bash scripts/bootstrap-budget-check.sh
会话通道架构
理解会话通道是OpenClaw可靠性最重要的概念。每条消息、定时任务和子代理都在会话通道中运行——一个一次只能处理一个任务的键控槽位。
黄金法则: agent:main:main是交互通道。绝不要将定时任务放在上面。
当定时任务在主通道上运行时,每条交互消息(Telegram、Discord、CLI)都会在其后排队。如果定时任务运行2分钟以上或卡住,代理将表现为无响应。
通道命名约定
| 用例 | 会话键模式 | 示例 |
|---|
| 交互(Telegram、CLI) | agent:main:main | 保留 — 绝不用作定时任务 |
| 定时任务 |
agent:main:cron:
| agent:main:cron:recall-archiver |
| 隔离的一次性任务 | agent:main:isolated | 可丢弃的任务 |
| 特定频道 | agent:main:telegram:group: | 每个聊天的会话 |
验证定时任务通道
在部署任何定时任务之前,验证其会话键:
bash
python3 -c
import json
with open($HOME/.openclaw/cron/jobs.json) as f:
jobs = json.load(f)[jobs]
bad = [j[name] for j in jobs if j.get(enabled, True) and j.get(sessionKey) == agent:main:main]
if bad:
print(fBLOCKED: {len(bad)} crons on main lane: {bad})
else:
print(All crons properly isolated.)
认证配置
OpenClaw使用两层认证系统。全局配置(openclaw.json)声明可用的提供商,但代理需要自己的auth-profiles.json才能实际进行认证。
常见错误: 在全局配置中添加提供商但忘记代理级别的配置文件。网关静默尝试并失败,记录FailoverError: No API key found for provider X。
修复模式:
json
// ~/.openclaw/agents/main/agent/auth-profiles.json
{
profiles: {
provider-name:default: {
keyRef: ENVVARNAME
}
},
lastGood: {
provider-name: provider-name:default
}
}
添加后验证:grep FailoverError.*API key ~/.openclaw/logs/gateway.err.log | tail -5
多模型故障转移
OpenClaw通过故障转移链路由到模型提供商。当主模型限速或超时时,会级联到下一个。理解这个链可以防止误报。
常见故障转移错误(来自真实日志):
- - FailoverError: LLM request timed out — 提供商缓慢,故障转移启动(3周内238次)
- FailoverError: API rate limit reached — 来自提供商的429错误(68次)
- FailoverError: The AI service is temporarily overloaded — 来自提供商的503/529错误(106次)
- FailoverError: Request was aborted — 客户端取消(58次)
这些在多模型设置中是预期的。只有当链中所有提供商同时失败,或冷却窗口重叠导致没有提供商可用时,才会成为问题。
诊断: 检查特定提供商是否持续失败:
bash
grep FailoverError ~/.openclaw/logs/gateway.err.log | \
grep -oE provider=[^ ]+ | sort | uniq -c | sort -rn | head -10
AGENTS.md管理
引导文件(AGENTS.md)加载到每个新会话中。它有严格的20,000字符限制。超过限制会导致静默截断,这会破坏指令并使压缩不可靠。
预算策略:
- 1. 保持AGENTS.md精简 — 仅包含身份、核心规则、关键工作流
- 将详细内容移至AGENTS-REFERENCE.md — 代理可按需读取
- 使用scripts/bootstrap-budget-check.sh监控逐部分使用情况
- 目标使用率50-70%,为自然增长留出空间
AGENTS.md过大的危险信号:
- - 在不是特别大的会话上出现压缩超时
- 代理忘记AGENTS.md后面部分的指令
- 新会话以截断或乱码的上下文开始
应用的设计模式
此技能应用了Agentic Design Patterns框架(Gulli, 2025)中的六种模式。理解每个护栏存在的原因有助于您将其适应到自己的设置中。
有关完整的模式描述,请阅读references/design-patterns.md。
| 模式 | 应用位置 | 预防的问题 |
|---|
| 路由(P2) | 通道架构 | 定时任务/交互死锁 |
| 异常处理(P12) |
看门狗警报 | 静默失败被忽视 |
| 目标监控(P11) | 引导预算检查 | 指令截断 |
| 资源感知优化(P16) | 会话大小限制 | 内存溢出崩溃和压缩超时 |
| 评估与监控(P19) | 日志扫描 | 跨错误类别的模式检测 |