Seamless Restart
Zero-downtime gateway restart protocol with automatic context recovery. Prevents the
common problem where agents lose context and go silent after gateway restarts.
Why This Exists
Gateway restarts (config changes, updates, manual restarts) cause context loss because
a new API session starts with no conversation history. Without this protocol, the agent
wakes up with no memory of what it was doing and no way to proactively notify the user.
The Protocol
Every gateway restart MUST follow these three steps in order:
Step 1: Save State to NOW.md
Before restarting, update NOW.md in the workspace root with:
CODEBLOCK0
Keep it concise. This file is read on every session start, so avoid bloat.
Step 2: Notify + Schedule Recovery Cron
Send a pre-restart notification to the user, then schedule a one-shot cron job
that fires ~1 minute after restart to trigger recovery:
Send notification:
CODEBLOCK1
Schedule recovery cron:
CODEBLOCK2
The cron job is automatically deleted after it fires (one-shot).
Step 3: Execute Restart
Now restart the gateway:
CODEBLOCK3
Or for config changes:
CODEBLOCK4
Both trigger a SIGUSR1 restart.
Post-Restart (Automatic)
When the recovery cron fires after restart:
- 1. Read NOW.md to restore context
- Send recovery notification to the user confirming the restart completed
- Resume any active tasks listed in NOW.md
- Clear the Post-Restart Action section of NOW.md (set to "none")
Channel-Specific Notification
Adapt the notification target based on where the user was chatting:
| Channel | Notification Method |
|---|
| Discord | INLINECODE1 |
| Telegram |
message(action=send, channel=telegram, target=<chatId>) |
| Other | Use the channel and target from the pre-restart session |
Always include the channel target in the NOW.md so the recovery cron knows where to send.
Edge Cases
Multiple restarts in quick succession:
If another restart is needed before the recovery cron fires, cancel the old cron
and create a new one. Only one recovery cron should exist at a time.
Restart during sub-agent tasks:
Note any running sub-agents in NOW.md. After restart, sub-agents that were in progress
will have been terminated. The user should be informed which tasks were interrupted.
Unexpected restarts (crashes):
This protocol only covers intentional restarts. For crash recovery, the heartbeat
mechanism is the fallback — if the agent misses heartbeats, it should check NOW.md
on the next activation.
Integration with AGENTS.md
Add this to your User Rules or Operations section:
CODEBLOCK5
Example: Full Restart Flow
CODEBLOCK6
无缝重启
零停机网关重启协议,支持自动上下文恢复。防止代理在网关重启后丢失上下文并陷入沉默的常见问题。
为何存在此协议
网关重启(配置变更、更新、手动重启)会导致上下文丢失,因为新的API会话启动时没有对话历史记录。没有此协议,代理醒来时不会记得自己正在做什么,也无法主动通知用户。
协议内容
每次网关重启必须按顺序执行以下三个步骤:
步骤1:保存状态到NOW.md
重启前,在工作区根目录更新NOW.md:
markdown
NOW.md - 当前状态快照
最后更新
- - 时间:[当前时间戳]
- 会话:[用户所在的频道/聊天]
- 状态:[正在发生的事情]
活跃任务
近期上下文
重启后操作
- - [重启后要执行的具体操作,例如通知用户、继续任务]
保持简洁。此文件在每个会话启动时都会被读取,因此避免内容臃肿。
步骤2:通知 + 安排恢复定时任务
向用户发送重启前通知,然后安排一个一次性定时任务,在重启后约1分钟触发以执行恢复:
发送通知:
message(action=send, channel=<当前频道>, target=<当前频道ID>,
message=⚡ 正在重启网关——约1分钟后恢复...)
安排恢复定时任务:
cron(action=add, job={
name: restart-recovery,
schedule: {kind: at, at: <从现在起1分钟后,ISO-8601 UTC格式>},
payload: {
kind: systemEvent,
text: 重启恢复:你刚刚重启了。立即读取NOW.md。
然后通知用户你已恢复,并总结你之前正在做什么。
将通知发送到重启前用户所在的同一频道。
},
sessionTarget: main,
enabled: true
})
定时任务触发后会自动删除(一次性任务)。
步骤3:执行重启
现在重启网关:
gateway(action=restart, note=<人类可读的原因>)
或者进行配置变更:
gateway(action=config.patch, raw=<配置>, note=<原因>)
两者都会触发SIGUSR1重启。
重启后(自动)
当恢复定时任务在重启后触发时:
- 1. 读取NOW.md 以恢复上下文
- 发送恢复通知 给用户,确认重启完成
- 恢复NOW.md中列出的任何活跃任务
- 清除NOW.md中的重启后操作部分(设置为无)
按频道适配通知
根据用户聊天的位置调整通知目标:
| 频道 | 通知方式 |
|---|
| Discord | message(action=send, channel=discord, target=<频道ID>, guildId=<服务器ID>) |
| Telegram |
message(action=send, channel=telegram, target=<聊天ID>) |
| 其他 | 使用重启前会话中的频道和目标 |
始终在NOW.md中包含频道目标,以便恢复定时任务知道发送到哪里。
边界情况
短时间内多次重启:
如果在恢复定时任务触发前需要再次重启,取消旧的定时任务并创建新的。一次只能存在一个恢复定时任务。
子代理任务期间重启:
在NOW.md中记录任何正在运行的子代理。重启后,进行中的子代理将被终止。应告知用户哪些任务被中断。
意外重启(崩溃):
此协议仅涵盖有意重启。对于崩溃恢复,心跳机制是后备方案——如果代理错过心跳,应在下次激活时检查NOW.md。
与AGENTS.md集成
将此内容添加到用户规则或操作部分:
markdown
- - 网关重启协议:任何网关重启都必须使用无缝重启技能。三个步骤:(1) 更新NOW.md,(2) 通知 + 设置恢复定时任务,(3) 重启。未完成步骤1和2前不得重启。
示例:完整重启流程
- 1. 代理因配置变更需要重启
- 2. 代理更新NOW.md:
状态:正在应用新的Gemini API密钥。会话:Discord #杂项。
重启后:在Discord #杂项中通知Zihao配置已应用。
- 3. 代理发送:⚡ 正在应用配置变更——重启中,约1分钟后恢复...
- 4. 代理创建T+60秒的一次性定时任务:
systemEvent → 重启恢复:读取NOW.md,通知用户,继续任务。
- 5. 代理调用:gateway(action=config.patch, raw={...}, note=新API密钥)
- 6. 网关重启。约60秒后,定时任务触发。
- 7. 代理读取NOW.md,发送:✅ 已恢复在线。配置变更成功应用。
- 8. 代理清除NOW.md中的重启后操作。