restart-recovery重启恢复

Make OpenClaw agent workflows restart-safe using checkpoint files, idempotent step tracking, wake/resume handoff, and stale-checkpoint monitoring. Use when users ask to recover from restarts, preserve progress across updates/config restarts, or implement checkpoint → restart → wake → resume patterns.

作者: admin | 来源: ClawHub

Restart Recovery

Implement restart-safe execution with this sequence:

1. checkpoint
restart
wake
resume from file

Use bundled scripts

- Use scripts/checkpoint_tool.py for deterministic checkpoint lifecycle:

- start, update, resume, complete, list

- Use scripts/checkpoint_selfcheck.py for stale unfinished checkpoint alerts without LLM/tool-token usage.

Required operating rules

- Write checkpoints before any restart-prone operation (config patch/apply, update, service restart, long multi-step jobs).
Use atomic file writes (.tmp then rename).
Track completed and remaining steps explicitly.
Include an idempotency key per workflow to avoid duplicate side effects after resume.
Never write secrets/tokens to checkpoint files.
Acquire a resume lock before continuing unfinished work.

Recommended checkpoint location

- Per agent: INLINECODE8
Shared/default workspace flows: memory/checkpoints/*.json at workspace root

Startup instruction to add in AGENTS.md

Add this exact section:

CODEBLOCK0

No-LLM stale checkpoint monitor

Use host scheduler (launchd/systemd/cron), not LLM cron jobs.

- Run every 10 minutes.
Alert only when unfinished checkpoints are older than threshold.
Log to local file for audit.

Suggested execution flow

1. checkpoint_tool.py start before risky step.
Perform step.
INLINECODE11.
If restart happens, wake session/process.
On startup/re-entry, checkpoint_tool.py resume and continue.
INLINECODE13 when done.

Validation checklist

- Simulate mid-work restart and verify resume from last completed step.
Confirm idempotency (no duplicate sends/writes/actions).
Confirm stale-check script only alerts after threshold.
Confirm old checkpoint cleanup policy (expiry).

重启恢复

按照以下顺序实现重启安全执行：

1. 检查点
重启
唤醒
从文件恢复

使用捆绑脚本

- 使用 scripts/checkpoint_tool.py 实现确定性检查点生命周期：

- start（开始）、update（更新）、resume（恢复）、complete（完成）、list（列表）

- 使用 scripts/checkpoint_selfcheck.py 检测过时未完成的检查点警报，无需使用LLM/工具令牌。

必需的操作规则

- 在任何容易引发重启的操作（配置补丁/应用、更新、服务重启、长时间多步骤任务）之前写入检查点。
使用原子文件写入（先写 .tmp 文件，再重命名）。
明确跟踪已完成和剩余步骤。
每个工作流包含一个幂等键，以避免恢复后产生重复副作用。
切勿将密钥/令牌写入检查点文件。
在继续未完成的工作之前，获取恢复锁。

需添加到 AGENTS.md 的启动指令

添加以下确切内容：

重启安全工作流规则

启动时，检查 memory/checkpoints/*.json 中是否存在未完成的工作流。如果找到，获取恢复锁，验证检查点模式/哈希值，并从最后一个完成的幂等步骤继续执行。

无LLM的过时检查点监控

使用主机调度器（launchd/systemd/cron），而非LLM的cron任务。

- 每10分钟运行一次。
仅在未完成的检查点超过阈值时发出警报。
记录到本地文件以供审计。

建议的执行流程

1. 在风险步骤之前执行 checkpointtool.py start。
执行步骤。
执行 checkpointtool.py update --complete <步骤> --step <下一步>。
如果发生重启，唤醒会话/进程。
在启动/重新进入时，执行 checkpointtool.py resume 并继续。
完成后执行 checkpointtool.py complete。

验证清单

- 模拟工作流中途重启，并验证从最后完成的步骤恢复。
确认幂等性（无重复发送/写入/操作）。
确认过时检查脚本仅在超过阈值后发出警报。
确认旧检查点清理策略（过期时间）。

restart-recovery重启恢复