Incident Response
Seven phases, in order. Never skip. Never assume — follow the evidence.
Outputs produced by this skill:
- - Root cause statement (5 Whys chain with evidence citations)
- Restore confirmation (what was restored, verified working)
- Prevention commit (git commit hash of guard/rule added)
- Monitoring cron (job ID + schedule)
- Learning entry (appended to
~/.openclaw/learnings/rules.md)
Phase 0: Triage (2 min)
Check current state FIRST before investigating history.
CODEBLOCK0
If currently working → report "recovered, investigating cause." If still broken → proceed.
Phase 1: Evidence Collection
Gather hard evidence from four sources:
1a. Config backups timeline
CODEBLOCK1
1b. Git audit trail
CODEBLOCK2
1c. Session logs (who did what)
CODEBLOCK3
1d. Config backup diff (find the exact moment of change)
CODEBLOCK4
Stop and document: Who changed what, when, which session, which tool call.
Phase 2: 5 Whys Analysis
Write each "why" as a statement of fact backed by evidence from Phase 1.
CODEBLOCK5
Rule: Every "why" must cite a specific file, log entry, timestamp, or command output. No assumptions.
Phase 3: Restore
Restore to last known-good state using backup timeline from Phase 1.
CODEBLOCK6
Verify restore: Check that the restored value matches the good backup. Re-run the user's original failing action.
Phase 4: Prevention
Add guards proportional to the severity and recurrence risk. See references/prevention-patterns.md for full patterns. Quick reference:
For config fields that must not decrease:
Add guard to config-validate.sh --merge (see references for template)
For agent behavior rules:
Add to ~/.openclaw/agents/<id>/agent/SOUL.md as a Hard Rule (HR-NNN)
For recurring mistakes:
Add to ~/.openclaw/learnings/rules.md with category and date
For schema validation gaps:
Update config-validate.sh valid_keys list after verifying against DeepWiki
Always commit prevention changes to git:
CODEBLOCK7
Phase 5: Monitor
Set a recurring cron job that runs until user confirms "good enough" (minimum 7 days, 30 days for recurring incidents).
CODEBLOCK8
See references/cron-template.md for the full cron job prompt template.
Phase 6: Document
Write to ~/.openclaw/learnings/rules.md if a Hard Rule should be added:
- - Category: HR (Hard Rule, recurring) or SR (Soft Rule, first offense)
- Include: what triggered, what the rule is, date learned, why it matters
Update MEMORY.md with incident summary if it's systemic.
Configuration
No persistent configuration required. Adapt the following to your environment:
| Variable | Description | Example |
|---|
| Remote host | SSH target for remote investigations | INLINECODE9 → your Titan/server hostname |
| Config backup path |
Where OpenClaw stores automatic config backups |
~/.openclaw/config-backups/ |
| Session key | Your messaging session key for cron reports |
agent:main-signal:signal:<your-number> |
| Learnings path | Where rules are persisted |
~/.openclaw/learnings/rules.md |
See references/cron-template.md for full cron report configuration.
Quick Diagnosis Checklists
See references/checklists.md for:
- - Gateway crash checklist
- Binding loss checklist
- Config key disappeared checklist
- Agent routing wrong checklist
- Vector search not finding content checklist
事件响应
按顺序执行七个阶段。绝不跳过。绝不假设——遵循证据。
此技能产生的输出:
- - 根本原因陈述(5个为什么链条,附证据引用)
- 恢复确认(已恢复的内容,已验证正常工作)
- 预防提交(添加的防护/规则的git提交哈希值)
- 监控定时任务(任务ID + 调度计划)
- 学习条目(追加到~/.openclaw/learnings/rules.md)
阶段0:分流(2分钟)
在调查历史之前,先检查当前状态。
bash
现在是否真的出问题了?
openclaw status
ssh
launchctl list | grep openclaw
使用正确的协议进行测试(检查来源:HTTP还是HTTPS?)
如果当前正常工作 → 报告已恢复,正在调查原因。 如果仍然故障 → 继续。
阶段1:证据收集
从四个来源收集硬证据:
1a. 配置备份时间线
bash
查看随时间变化的绑定/设置数量
ssh python3 << EOF
import json, glob, os
for f in sorted(glob.glob(~/.openclaw/config-backups/openclaw-*.json), key=os.path.getmtime):
d = json.load(open(f))
import datetime
dt = datetime.datetime.fromtimestamp(os.path.getmtime(f)).strftime(%Y-%m-%d %H:%M)
# 自定义:绑定、代理、频道等
count = len(d.get(bindings, []))
ids = [b.get(agentId) for b in d.get(bindings, [])]
print(f{dt} [{count}] {ids})
EOF
1b. Git审计追踪
bash
ssh cd ~/.openclaw && git log --oneline -20
ssh cd ~/.openclaw && git diff -- openclaw.json | grep ^[+-] | grep -v ^---\|^+++
1c. 会话日志(谁做了什么)
bash
查找触及故障配置键的会话
ssh rg -rl keyword ~/.openclaw/agents//sessions/.jsonl | head -5
从会话中提取工具调用
ssh python3 << EOF
import json
for line in open(SESSION.jsonl):
obj = json.loads(line)
if obj.get(type) != message: continue
for block in obj.get(message,{}).get(content,[]):
if block.get(type) == toolCall and block.get(name) in [Write,Edit,gateway,exec]:
print(obj[timestamp], block[name], str(block.get(input,))[:200])
EOF
1d. 配置备份差异对比(找到变更的确切时刻)
bash
比较可疑备份前后的文件
python3 -c
import json
a = json.load(open(backup-before.json))
b = json.load(open(backup-after.json))
比较特定字段
print(之前:, a.get(bindings))
print(之后:, b.get(bindings))
停下来并记录: 谁在何时、哪个会话、哪个工具调用中更改了什么。
阶段2:5个为什么分析
将每个为什么写成一个事实陈述,并附上阶段1的证据。
为什么1:[症状] — 例如绑定数从17降至1
证据:备份时间戳 + 数量
为什么2:[直接原因] — 例如在太平洋标准时间09:38写入了一个完整的配置替换
证据:备份修改时间 + 内容差异
为什么3:[机制] — 例如代理从头开始编写了新配置,而非基于当前配置
证据:会话日志工具调用 + 内容
为什么4:[系统缺口] — 例如config-validate.sh --merge没有针对绑定数量下降的防护
证据:脚本检查显示没有此类检查
为什么5:[根本原因] — 例如从配置写入到用户下次报告之间没有自动检测机制
证据:当时没有监控定时任务,没有git
规则: 每个为什么必须引用特定的文件、日志条目、时间戳或命令输出。不允许假设。
阶段3:恢复
使用阶段1的备份时间线恢复到最后一个已知的良好状态。
bash
恢复特定字段(始终合并,绝不替换)
PATCH=$(python3 -c
import json
good = json.load(open(/path/to/good-backup.json))
patch = {bindings: good[bindings]} # 自定义字段
print(json.dumps(patch))
)
echo $PATCH | ssh ~/.openclaw/scripts/config-validate.sh --merge
重启网关
ssh launchctl stop ai.openclaw.gateway && sleep 2 && launchctl start ai.openclaw.gateway
ssh launchctl list | grep ai.openclaw.gateway # 验证退出码为0
验证恢复: 检查恢复后的值是否与良好备份一致。重新执行用户最初失败的操作。
阶段4:预防
根据严重程度和复发风险添加防护措施。完整模式请参见references/prevention-patterns.md。快速参考:
对于不得减少的配置字段:
向config-validate.sh --merge添加防护(参见参考模板)
对于代理行为规则:
作为硬规则(HR-NNN)添加到~/.openclaw/agents//agent/SOUL.md
对于重复出现的错误:
按类别和日期添加到~/.openclaw/learnings/rules.md
对于模式验证缺口:
在对照DeepWiki验证后,更新config-validate.sh的valid_keys列表
始终将预防性更改提交到git:
bash
ssh cd ~/.openclaw && git add -A && git commit -m 预防:<添加的内容> 在 <事件> 之后
阶段5:监控
设置一个重复执行的定时任务,直到用户确认足够好为止(最少7天,重复事件30天)。
定时任务结构:
- - 调度:每24小时(高严重性事件每N小时)
- 任务:检查特定指标 → 与基线比较 → 如果降级:恢复 + 5个为什么 → 报告
- 报告渠道:sessions_send发送到您偏好的渠道(Signal、Telegram、Discord)
- 自动升级:如果连续3天以上需要相同的修复 → 升级预防措施
- 终止:用户明确说停止监控或连续N天无事件
完整的定时任务提示模板请参见references/cron-template.md。
阶段6:文档记录
如果需要添加硬规则,写入~/.openclaw/learnings/rules.md:
- - 类别:HR(硬规则,重复出现)或SR(软规则,首次违规)
- 包含:触发条件、规则内容、学习日期、为何重要
如果是系统性问题,将事件摘要更新到MEMORY.md。
配置
无需持久化配置。根据您的环境调整以下内容:
| 变量 | 描述 | 示例 |
|---|
| 远程主机 | 远程调查的SSH目标 | <remote-host> → 您的Titan/服务器主机名 |
| 配置备份路径 |
OpenClaw存储自动配置备份的位置 | ~/.openclaw/config-backups/ |
| 会话密钥 | 用于定时任务报告的您的消息会话密钥 | agent:main-signal:signal:<您的号码> |
| 学习路径 | 规则持久化的位置 | ~/.openclaw/learnings/rules.md |
完整的定时任务报告配置请参见references/cron-template.md。
快速诊断检查清单
请参见references/checklists.md了解:
- - 网关崩溃检查清单
- 绑定丢失检查清单
- 配置键消失检查清单
- 代理路由错误检查清单
- 向量搜索找不到内容检查清单