Incident Response

Seven phases, in order. Never skip. Never assume — follow the evidence.

Outputs produced by this skill:

- Root cause statement (5 Whys chain with evidence citations)
Restore confirmation (what was restored, verified working)
Prevention commit (git commit hash of guard/rule added)
Monitoring cron (job ID + schedule)
Learning entry (appended to ~/.openclaw/learnings/rules.md)

Phase 0: Triage (2 min)

Check current state FIRST before investigating history.

CODEBLOCK0

If currently working → report "recovered, investigating cause." If still broken → proceed.

Phase 1: Evidence Collection

Gather hard evidence from four sources:

1a. Config backups timeline

CODEBLOCK1

1b. Git audit trail

CODEBLOCK2

1c. Session logs (who did what)

CODEBLOCK3

1d. Config backup diff (find the exact moment of change)

CODEBLOCK4

Stop and document: Who changed what, when, which session, which tool call.

Phase 2: 5 Whys Analysis

Write each "why" as a statement of fact backed by evidence from Phase 1.

CODEBLOCK5

Rule: Every "why" must cite a specific file, log entry, timestamp, or command output. No assumptions.

Phase 3: Restore

Restore to last known-good state using backup timeline from Phase 1.

CODEBLOCK6

Verify restore: Check that the restored value matches the good backup. Re-run the user's original failing action.

Phase 4: Prevention

Add guards proportional to the severity and recurrence risk. See references/prevention-patterns.md for full patterns. Quick reference:

For config fields that must not decrease:
Add guard to config-validate.sh --merge (see references for template)

For agent behavior rules:
Add to ~/.openclaw/agents/<id>/agent/SOUL.md as a Hard Rule (HR-NNN)

For recurring mistakes:
Add to ~/.openclaw/learnings/rules.md with category and date

For schema validation gaps:
Update config-validate.sh valid_keys list after verifying against DeepWiki

Always commit prevention changes to git:
CODEBLOCK7

Phase 5: Monitor

Set a recurring cron job that runs until user confirms "good enough" (minimum 7 days, 30 days for recurring incidents).

CODEBLOCK8

See references/cron-template.md for the full cron job prompt template.

Phase 6: Document

Write to ~/.openclaw/learnings/rules.md if a Hard Rule should be added:

- Category: HR (Hard Rule, recurring) or SR (Soft Rule, first offense)
Include: what triggered, what the rule is, date learned, why it matters

Update MEMORY.md with incident summary if it's systemic.

Configuration

No persistent configuration required. Adapt the following to your environment:

Variable	Description	Example
Remote host	SSH target for remote investigations	INLINECODE9 → your Titan/server hostname
Config backup path

See references/cron-template.md for full cron report configuration.

Quick Diagnosis Checklists

See references/checklists.md for:

- Gateway crash checklist
Binding loss checklist
Config key disappeared checklist
Agent routing wrong checklist
Vector search not finding content checklist

事件响应

按顺序执行七个阶段。绝不跳过。绝不假设——遵循证据。

此技能产生的输出：

- 根本原因陈述（5个为什么链条，附证据引用）
恢复确认（已恢复的内容，已验证正常工作）
预防提交（添加的防护/规则的git提交哈希值）
监控定时任务（任务ID + 调度计划）
学习条目（追加到~/.openclaw/learnings/rules.md）

阶段0：分流（2分钟）

在调查历史之前，先检查当前状态。

bash

现在是否真的出问题了？

openclaw status
ssh launchctl list | grep openclaw

使用正确的协议进行测试（检查来源：HTTP还是HTTPS？）

如果当前正常工作 → 报告已恢复，正在调查原因。如果仍然故障 → 继续。

阶段1：证据收集

从四个来源收集硬证据：

1a. 配置备份时间线

bash

查看随时间变化的绑定/设置数量

ssh python3 << EOF import json, glob, os for f in sorted(glob.glob(~/.openclaw/config-backups/openclaw-*.json), key=os.path.getmtime): d = json.load(open(f)) import datetime dt = datetime.datetime.fromtimestamp(os.path.getmtime(f)).strftime(%Y-%m-%d %H:%M) # 自定义：绑定、代理、频道等 count = len(d.get(bindings, [])) ids = [b.get(agentId) for b in d.get(bindings, [])] print(f{dt} [{count}] {ids}) EOF

1b. Git审计追踪

bash ssh cd ~/.openclaw && git log --oneline -20 ssh cd ~/.openclaw && git diff

-- openclaw.json | grep ^[+-] | grep -v ^---\|^+++

1c. 会话日志（谁做了什么）

bash

查找触及故障配置键的会话

ssh rg -rl keyword ~/.openclaw/agents//sessions/.jsonl | head -5

从会话中提取工具调用

ssh python3 << EOF import json for line in open(SESSION.jsonl): obj = json.loads(line) if obj.get(type) != message: continue for block in obj.get(message,{}).get(content,[]): if block.get(type) == toolCall and block.get(name) in [Write,Edit,gateway,exec]: print(obj[timestamp], block[name], str(block.get(input,))[:200]) EOF

1d. 配置备份差异对比（找到变更的确切时刻）

bash

比较可疑备份前后的文件

python3 -c import json a = json.load(open(backup-before.json)) b = json.load(open(backup-after.json))

比较特定字段

print(之前:, a.get(bindings)) print(之后:, b.get(bindings))

停下来并记录： 谁在何时、哪个会话、哪个工具调用中更改了什么。

阶段2：5个为什么分析

将每个为什么写成一个事实陈述，并附上阶段1的证据。

为什么1：[症状] — 例如绑定数从17降至1
证据：备份时间戳 + 数量

为什么2：[直接原因] — 例如在太平洋标准时间09:38写入了一个完整的配置替换
证据：备份修改时间 + 内容差异

为什么3：[机制] — 例如代理从头开始编写了新配置，而非基于当前配置
证据：会话日志工具调用 + 内容

为什么4：[系统缺口] — 例如config-validate.sh --merge没有针对绑定数量下降的防护
证据：脚本检查显示没有此类检查

为什么5：[根本原因] — 例如从配置写入到用户下次报告之间没有自动检测机制
证据：当时没有监控定时任务，没有git

规则： 每个为什么必须引用特定的文件、日志条目、时间戳或命令输出。不允许假设。

阶段3：恢复

使用阶段1的备份时间线恢复到最后一个已知的良好状态。

bash

恢复特定字段（始终合并，绝不替换）

PATCH=$(python3 -c
import json
good = json.load(open(/path/to/good-backup.json))
patch = {bindings: good[bindings]} # 自定义字段
print(json.dumps(patch))
)
echo $PATCH | ssh ~/.openclaw/scripts/config-validate.sh --merge

重启网关

ssh launchctl stop ai.openclaw.gateway && sleep 2 && launchctl start ai.openclaw.gateway ssh launchctl list | grep ai.openclaw.gateway # 验证退出码为0

验证恢复： 检查恢复后的值是否与良好备份一致。重新执行用户最初失败的操作。

阶段4：预防

根据严重程度和复发风险添加防护措施。完整模式请参见references/prevention-patterns.md。快速参考：

对于不得减少的配置字段：
向config-validate.sh --merge添加防护（参见参考模板）

对于代理行为规则：
作为硬规则（HR-NNN）添加到~/.openclaw/agents//agent/SOUL.md

对于重复出现的错误：
按类别和日期添加到~/.openclaw/learnings/rules.md

对于模式验证缺口：
在对照DeepWiki验证后，更新config-validate.sh的valid_keys列表

始终将预防性更改提交到git：
bash
ssh cd ~/.openclaw && git add -A && git commit -m 预防：<添加的内容> 在 <事件> 之后

阶段5：监控

设置一个重复执行的定时任务，直到用户确认足够好为止（最少7天，重复事件30天）。

定时任务结构：

- 调度：每24小时（高严重性事件每N小时）
任务：检查特定指标 → 与基线比较 → 如果降级：恢复 + 5个为什么 → 报告
报告渠道：sessions_send发送到您偏好的渠道（Signal、Telegram、Discord）
自动升级：如果连续3天以上需要相同的修复 → 升级预防措施
终止：用户明确说停止监控或连续N天无事件

完整的定时任务提示模板请参见references/cron-template.md。

阶段6：文档记录

如果需要添加硬规则，写入~/.openclaw/learnings/rules.md：

- 类别：HR（硬规则，重复出现）或SR（软规则，首次违规）
包含：触发条件、规则内容、学习日期、为何重要

如果是系统性问题，将事件摘要更新到MEMORY.md。

配置

无需持久化配置。根据您的环境调整以下内容：

变量	描述	示例
远程主机	远程调查的SSH目标	<remote-host> → 您的Titan/服务器主机名
配置备份路径

完整的定时任务报告配置请参见references/cron-template.md。

快速诊断检查清单

请参见references/checklists.md了解：

- 网关崩溃检查清单
绑定丢失检查清单
配置键消失检查清单
代理路由错误检查清单
向量搜索找不到内容检查清单

incident-response事件响应