Incident Replay Agent Failure Forensics

Post-mortem analysis for AI agent failures. Capture state, reconstruct timelines, identify root causes. When your agent breaks, know what happened, why, and how to prevent it.

Post-mortem analysis for AI agent failures. Capture state, reconstruct timelines, identify root causes.

When your agent breaks, you need to know what happened, why, and how to prevent it next time. Incident Replay captures workspace state at points in time, detects when things go wrong, reconstructs the sequence of events, and classifies root causes with actionable remediation steps.

The Problem

Your agent crashed overnight. Files are missing. The config looks wrong. The logs are a wall of text. What happened? When? Why?

Without forensics tooling, post-mortem analysis is manual detective work: diffing files by hand, grepping logs, guessing at causation. Incident Replay automates the mechanics so you can focus on understanding.

What It Does

1. Capture (`incident_capture.py`)

- Take point-in-time snapshots of your workspace (files, sizes, hashes, content)
Configurable include/exclude patterns (track what matters, ignore noise)
Automatic snapshot pruning (keep last N)
Compare any two snapshots to see exactly what changed
Trigger detection — automatically flag incidents based on:

- Log patterns (tracebacks, errors, fatal messages) - File changes (unexpected deletions, config modifications) - Content patterns (secrets in output, constraint violations) - Empty output files

2. Replay (`incident_replay.py`)

- Build chronological timelines from snapshots, file changes, and triggers
Extract decision chains from agent logs and memory files
Heuristic root cause classification:

- Config error — misconfiguration caused the failure - Data corruption — input data was malformed or missing - Drift — gradual workspace state degradation - External failure — API/network/filesystem dependency failed - Logic error — bug in agent logic or prompt - Resource exhaustion — ran out of memory, disk, tokens, or time

- Remediation suggestions tailored to each root cause category
Incident database with persistent storage and pattern tracking

3. Report (`incident_report.py`)

- Full incident reports with timeline, changes, triggers, and remediation
Summary reports across all incidents with severity and root cause breakdowns
Decision chain visualisation (what the agent decided and why)
Export markdown or JSON

Quick Start

CODEBLOCK0

Programmatic Usage

CODEBLOCK1

Use Cases

- Overnight failure analysis: Agent ran unattended and broke — what happened?
Config change impact: Track exactly what changed after a config update
Drift detection: Compare weekly snapshots to catch gradual degradation
Secret leak detection: Catch credentials or sensitive data in agent outputs
Regression forensics: Agent used to work, now it doesn't — find the divergence point
Team incident management: Track incidents over time, find recurring patterns

What's Included

File	Purpose
INLINECODE3	State snapshot and change detection
INLINECODE4

Requirements

- Python 3.8+
No external dependencies (stdlib only)
Works on any OS
Platform-agnostic (works with any file-based AI agent workspace)

Configuration

See config_example.json for the complete reference. Key areas:

- WORKSPACE_ROOT — Directory to monitor
INCLUDE/EXCLUDE_PATTERNS — What files to capture
TRIGGERS — Conditions that flag incidents (log patterns, file changes, content scans)
ROOT_CAUSE_CATEGORIES — Classification categories with descriptions and remediation
DECISION_MARKERS — Regex patterns to extract agent decisions from logs
LOG_FILES — Which files to scan for decision chains

quality-verified

License

MIT — See LICENSE file.

⚠️ Security Note — Config File

Configuration is loaded from a JSON file. This is safe to share — no code execution.

- Config path is validated for existence and size (1MB cap) before loading
Must be a .json file — raises ValueError if given a non-JSON path
Keep your config under version control; it defines what triggers are watched and what's protected

⚠️ Disclaimer

This software is provided "AS IS", without warranty of any kind, express or implied.

USE AT YOUR OWN RISK.

- The author(s) are NOT liable for any damages, losses, or consequences arising from

the use or misuse of this software — including but not limited to financial loss, data loss, security breaches, business interruption, or any indirect/consequential damages.

- This software does NOT constitute financial, legal, trading, or professional advice.
Users are solely responsible for evaluating whether this software is suitable for

their use case, environment, and risk tolerance.

- No guarantee is made regarding accuracy, reliability, completeness, or fitness

for any particular purpose.

- The author(s) are not responsible for how third parties use, modify, or distribute

this software after purchase.

By downloading, installing, or using this software, you acknowledge that you have read
this disclaimer and agree to use the software entirely at your own risk.

DATA DISCLAIMER: This software processes and stores data locally on your system.
The author(s) are not responsible for data loss, corruption, or unauthorized access
resulting from software bugs, system failures, or user error. Always maintain
independent backups of important data. This software does not transmit data externally
unless explicitly configured by the user.

Support & Links


🐛 Bug Reports	TheShadowyRose@proton.me
☕ Ko-fi

Built with OpenClaw — thank you for making this possible.

🛠️ Need something custom? Custom OpenClaw agents & skills starting at $500. If you can describe it, I can build it. → Hire me on Fiverr

事件回放代理故障取证

AI代理故障的事后分析。捕获状态，重建时间线，识别根本原因。当您的代理出现故障时，了解发生了什么、为什么发生以及如何预防。

AI代理故障的事后分析。捕获状态，重建时间线，识别根本原因。

当您的代理出现故障时，您需要知道发生了什么、为什么发生以及如何在下一次预防。事件回放可在时间点捕获工作区状态，检测异常情况，重建事件序列，并对根本原因进行分类，提供可操作的修复步骤。

问题

您的代理在夜间崩溃了。文件丢失了。配置看起来有问题。日志是一堆文本。发生了什么？什么时候？为什么？

如果没有取证工具，事后分析就是手动侦探工作：手动比对文件、搜索日志、猜测因果关系。事件回放自动完成这些工作，让您可以专注于理解。

功能

1. 捕获 (incident_capture.py)

- 对工作区进行时间点快照（文件、大小、哈希值、内容）
可配置的包含/排除模式（跟踪重要内容，忽略噪音）
自动快照修剪（保留最近N个）
比较任意两个快照，精确查看更改内容
触发检测 — 基于以下条件自动标记事件：

- 日志模式（回溯、错误、致命消息） - 文件更改（意外删除、配置修改） - 内容模式（输出中的机密、约束违规） - 空输出文件

2. 回放 (incident_replay.py)

- 从快照、文件更改和触发器中构建时间顺序时间线
从代理日志和内存文件中提取决策链
启发式根本原因分类：

- 配置错误 — 配置错误导致故障 - 数据损坏 — 输入数据格式错误或缺失 - 漂移 — 工作区状态逐渐退化 - 外部故障 — API/网络/文件系统依赖失败 - 逻辑错误 — 代理逻辑或提示中的错误 - 资源耗尽 — 内存、磁盘、令牌或时间耗尽

- 针对每个根本原因类别的修复建议
具有持久存储和模式跟踪的事件数据库

3. 报告 (incident_report.py)

- 完整的事件报告，包含时间线、更改、触发器和修复措施
所有事件的摘要报告，包含严重性和根本原因分类
决策链可视化（代理决定做什么以及为什么）
导出为Markdown或JSON格式

快速开始

bash

1. 配置

cp configexample.json incidentconfig.json

编辑工作区根目录、触发器、日志模式

2. 获取基准快照

python3 incidentcapture.py --config incidentconfig.json --snapshot --label baseline

3. ... 代理执行工作，出现问题 ...

4. 获取事后快照

python3 incidentcapture.py --config incidentconfig.json --snapshot --label post-incident

5. 查看更改内容

python3 incidentcapture.py --config incidentconfig.json \ --diff incidentdata/snapshots/SNAP1.json incidentdata/snapshots/SNAP2.json

6. 检查触发器

python3 incidentcapture.py --config incidentconfig.json \ --triggers incidentdata/snapshots/SNAP1.json incidentdata/snapshots/SNAP2.json

7. 完整分析 — 创建包含时间线、根本原因和修复措施的事件

python3 incidentreplay.py --config incidentconfig.json \ --analyze incidentdata/snapshots/SNAP1.json incidentdata/snapshots/SNAP2.json \ --title 部署期间代理崩溃

8. 生成事件报告

python3 incidentreport.py --config incidentconfig.json --incident INC-0001

9. 查看所有事件和模式

python3 incidentreplay.py --config incidentconfig.json --incidents python3 incidentreplay.py --config incidentconfig.json --patterns python3 incidentreport.py --config incidentconfig.json --summary

编程使用

python
from incidentcapture import Capturer, Snapshot, load_config
from incident_replay import Analyzer

cfg = loadconfig(incident_config.json)
cap = Capturer(cfg)
analyzer = Analyzer(cfg)

获取快照

before = cap.take_snapshot(label=before)

... 代理运行 ...

after = cap.take_snapshot(label=after)

分析

changes = cap.diff_snapshots(before, after) triggers = cap.check_triggers(before, after) decisions = analyzer.extract_decisions(after) timeline = analyzer.build_timeline( [before, after], triggers=[t.to_dict() for t in triggers], changes=changes, )

创建事件

incident = analyzer.create_incident( title=代理在执行任务X时失败, timeline=timeline, triggers=[t.to_dict() for t in triggers], file_changes=changes, decisions=decisions, ) print(f已创建 {incident.id}: {incident.root_cause})

使用场景

- 夜间故障分析： 代理无人值守运行并崩溃 — 发生了什么？
配置更改影响： 精确跟踪配置更新后的更改内容
漂移检测： 比较每周快照以捕捉逐渐退化
机密泄露检测： 捕获代理输出中的凭证或敏感数据
回归取证： 代理以前能工作，现在不能 — 找到分歧点
团队事件管理： 随时间跟踪事件，发现重复模式

包含内容

文件	用途
incidentcapture.py	状态快照和更改检测
incidentreplay.py

要求

- Python 3.8+
无外部依赖（仅标准库）
在任何操作系统上运行
平台无关（适用于任何基于文件的AI代理工作区）

配置

请参阅 config_example.json 获取完整参考。关键区域：

- WORKSPACEROOT — 要监控的目录
INCLUDE/EXCLUDEPATTERNS — 要捕获的文件
TRIGGERS — 标记事件的条件（日志模式、文件更改、内容扫描）
ROOTCAUSECATEGORIES — 带有描述和修复措施的分类类别
DECISIONMARKERS — 从日志中提取代理决策的正则表达式模式
LOGFILES — 要扫描决策链的文件

quality-verified

许可证

MIT — 请参阅 LICENSE 文件。

⚠️ 安全说明 — 配置文件

配置从JSON文件加载。可以安全共享 — 不执行代码。

- 加载前验证配置路径的存在性和大小（上限1MB）
必须是 .json 文件 — 如果给定非JSON路径则引发 ValueError
将配置纳入版本控制；它定义了监视哪些触发器以及保护什么

⚠️ 免责声明

本软件按原样提供，不提供任何明示或暗示的保证。

使用风险自负。

- 作者对因使用或滥用本软件而产生的任何损害、损失或后果概不负责 — 包括但不限于财务损失、数据丢失、安全漏洞、业务中断或任何间接/后果性损害。
本软件不构成财务、法律、交易或专业建议。
用户全权负责评估本软件是否适合其使用场景、环境和风险承受能力。
不对准确性、可靠性、完整性或任何特定用途的适用性做出任何保证。
作者不对第三方在购买后使用、修改或分发本软件的方式负责。

下载、安装或使用本软件即表示您已阅读本免责声明并同意完全自担风险使用本软件。

数据免责声明： 本软件在本地系统上处理和存储数据。作者对因软件错误、系统故障或用户错误导致的数据丢失、损坏或未经授权的访问概不负责。请始终保留重要数据的独立备份。除非用户明确配置，否则本软件不会将数据外部传输。

支持与链接


🐛 错误报告	TheShadowyRose@proton.me
☕ Ko-fi

Incident Replay Agent Failure Forensics" 故障根因分析