tonic-system-deploy

Software Deployment Workflow — Dual-Environment (UAT + PROD)

Background & Design Rationale

This skill was designed for systems where:

- Two live environments co-exist: UAT (testing/staging) and PROD (production)
Versions can diverge: UAT may be ahead of PROD by several releases
Deployments are nightly: automated pipelines run at scheduled times
Human approval is mandatory: no code goes to PROD without explicit admin sign-off
Bugs require structured triage: severity, origin environment, and version state all affect the deploy path

The key insight: choosing the wrong deploy flow when versions are mismatched can introduce regressions. Flow 1 assumes parity; Flow 2 handles divergence safely.

Prerequisites — Before Choosing a Flow

Step 0: Version Check (always do this first)

Question	Answer →
Are UAT and PROD on the same version?	→ Flow 1
Is UAT ahead of PROD by any version?

Version Mismatch Decision Tree

CODEBLOCK0

Flow 1 — UAT-First (Versions Aligned)

Scenario: Bug found in UAT or PROD when both environments run the same version.
Goal: Fix → validate in UAT → promote to PROD.
Result: UAT and PROD converge to same patched version.

Timeline

CODEBLOCK1

Human Checkpoints (Flow 1)

Checkpoint	Who	Action	Gate Condition
Confirm bug	Admin/Manager	Mark as confirmed	Bug is reproducible and valid
UAT validation

Admin/Manager | Click "Approve PROD Deploy" | Fix works, no regression in UAT |

Automation Nodes (Flow 1)

Time	Node	Input Status	Output Status	Action
T1	Phase 1	confirmed/planned	deployeduat	AI analysis + UAT deploy
T2

Phase 2 | pendingprod | deployed_prod | PROD deploy |

Flow 2 — PROD-First (Versions Misaligned)

Scenario: Bug found in PROD when UAT is ahead by one or more versions.
Why not Flow 1? Validating a PROD fix in a newer UAT environment risks false confidence — the fix may behave differently on the older PROD codebase.
Goal: Fix PROD directly → validate in PROD → cherry-pick back to UAT.
Result: PROD gets the fix immediately; UAT gets it merged back after PROD validation.

Timeline

CODEBLOCK2

Human Checkpoints (Flow 2)

Checkpoint	Who	Action	Gate Condition
Confirm bug	Admin/Manager	Mark as confirmed + select flow2	Bug confirmed in PROD, version mismatch verified
PROD validation

Admin/Manager | Click "Approve Merge UAT" | Fix verified in PROD, no regression |

Automation Nodes (Flow 2)

Time	Node	Input Status	Output Status	Action
T1	Phase 1	confirmed/planned	pendingprod	AI analysis (skip UAT)
T2

Flow 2 Important Note

T2 deadline matters. If admin approves UAT merge before T2 on the same day, the merge runs that night. If approved after T2, it runs the following night's T2. Always communicate the cutoff time to the team.

Status Reference

Status	Flow	Colour	Meaning	Next Action
INLINECODE0	Both	Grey	Bug reported, awaiting review	Admin confirms/rejects
INLINECODE1

Severity Rules

Severity	Pipeline Eligible?	Notes
INLINECODE11	✅ Yes	Both flows
INLINECODE12

Never let high/critical bugs wait for a scheduled pipeline. Treat them as emergency hotfixes.

Emergency Hotfix (Bypass Pipeline)

Scenario: Critical or high severity bug in PROD. Cannot wait for scheduled T1/T2.

Process

CODEBLOCK3

Checklist for Emergency Hotfix

- [ ] Severity confirmed as critical/high before bypassing pipeline
[ ] At least one other team member notified before deploy
[ ] Fix deployed and validated within agreed SLA (e.g. P1: 1 hour, P2: 4 hours)
[ ] Post-deploy smoke test completed (login, core workflow, affected feature)
[ ] Bug status updated manually in system
[ ] Telegram/Slack notification sent to stakeholders
[ ] Post-incident note added to bug record (root cause, fix summary)
[ ] UAT updated (cherry-pick or re-sync if needed)
[ ] Incident review scheduled (within 48h for P1)

Rollback Procedure

Scenario: A deploy (T1 or T2) introduces a regression or new failure.

Decision: When to Rollback

CODEBLOCK4

Rollback Process

CODEBLOCK5

Rollback Checklist

- [ ] Previous working commit/tag identified (git log)
[ ] Rollback scope defined (frontend / backend / both / DB)
[ ] Affected bug statuses reverted in system
[ ] Smoke test completed after rollback
[ ] Root cause of regression documented
[ ] Team + stakeholders notified
[ ] Fix plan for the reverted change recorded

Scheduled Deploy Times (Reference Only)

⚠️ These times are project-specific. Adapt per project SLA and business hours.

Slot	Name	Phase	Typical Window
T1	UAT/PROD-queue Deploy	Phase 1	Off-peak evening (e.g. 20:00)
T2

PROD/UAT-merge Deploy | Phase 2 | Late evening (e.g. 22:00) |

Principles for choosing T1/T2:

- T1 and T2 must have enough gap for human validation (minimum 1–2 hours)
Both should be outside business hours unless urgency demands otherwise
For 24/7 systems: choose lowest traffic window (check metrics)
Emergency hotfix: no scheduled time — deploy ASAP after approval

Telegram Notification Templates

Use these as the standard message format for each pipeline node.

T1 Complete — Flow 1 (UAT deployed)

CODEBLOCK6

T2 Complete — Flow 1 (PROD deployed)

CODEBLOCK7

T1 Complete — Flow 2 (PROD queued)

CODEBLOCK8

T2 Complete — Flow 2 (PROD deployed, UAT pending)

CODEBLOCK9

T2 Complete — Flow 2 (UAT merged)

CODEBLOCK10

Emergency Hotfix

CODEBLOCK11

Rollback

⚠️ Rollback 執行 — <environment>

原因：<brief reason>
回滾至：<version/commit>
時間：<datetime>
執行人：<admin>

狀態：已回滾，正在監控
下一步：<scheduled fix / investigation>

Adapting This Workflow to a New Project

When setting up a new project with this workflow:

1. Define T1 and T2 — pick times based on traffic patterns and SLA
Set severity policy — confirm which severities enter pipeline vs emergency hotfix
Configure Telegram/notification channels — who receives which notifications
Add DB columns — fix_flow, found_in_env, and status enum (see Status Reference)
Implement Phase 1 + Phase 2 cron jobs — schedule at T1 and T2
Add approval endpoints — approve-prod, approve-uat-merge, batch variants
Add status badges + action buttons — frontend must reflect all statuses clearly
Test the full cycle in UAT first — simulate a bug through both flows before going live
Document rollback steps — specific to the project's tech stack and DB

Pre-Deploy Checklist (T1 / T2)

Run before every scheduled deploy window.

- [ ] DB backup confirmed — last backup < 24h, or trigger manual backup now
[ ] Monitoring alerts active — error rate, response time, queue depth dashboards open
[ ] On-call admin reachable — someone available to respond within 15 min post-deploy
[ ] Change freeze check — not within a freeze window (see Change Freeze Policy)
[ ] Rollback path clear — previous working commit/tag identified and noted
[ ] Dependent services healthy — upstream/downstream APIs, DBs, message queues OK
[ ] Disk + memory OK — server has headroom (>20% free disk, <80% memory)

Post-Deploy Monitoring

After each T1 or T2 deploy, monitor for a minimum of 10 minutes before standing down.

Metrics to Watch

Metric	Healthy Threshold	Action if Breached
HTTP 5xx error rate	< 0.5%	Investigate immediately, consider rollback
API response time (p95)

Smoke Test Sequence (2–3 min)

1. Login with admin account
Navigate to the affected feature
Perform the action that triggered the bug
Confirm fix is working
Check 2–3 adjacent features for regression
Check system logs for new errors

If any smoke test step fails → rollback immediately, do not wait.

Multi-Service Deploy (Cross-Service Fixes)

Scenario: A bug fix requires changes to more than one service (e.g. backend API + frontend, or service A + service B).

Deploy Order Principle

CODEBLOCK13

Coordination Steps

1. Map dependencies — list all services affected and their dependency order
Stage the deploys — do not deploy all services simultaneously
Validate between services — after each service deploy, quick health check before next
Single rollback plan — define the exact reverse order and what to check at each step
Lock window — communicate to team that a multi-service deploy is in progress (no other deploys)

Status Tracking for Multi-Service

Tag the bug with affected services. Use release notes to list which service each fix applies to:

[backend] Fix: null pointer in task update handler
[frontend] Fix: error boundary not catching API timeout

Change Freeze Policy

Certain periods should have no scheduled pipeline deploys (T1/T2 suspended). Emergency hotfixes may still be approved by escalation.

Recommended Freeze Windows

Period	Recommended Action
Public holidays	Suspend T1/T2. Emergency hotfix requires 2-person approval.
Lunar New Year (3 days)

Declaring a Freeze

1. Update HEARTBEAT.md or project config with freeze start/end dates
Notify team via Telegram/channel
Pipeline cron jobs remain scheduled but agent checks freeze flag before executing
Emergency hotfix during freeze: requires explicit approval from admin + one other senior (two-person rule)

Freeze Flag (implementation)

In pipeline config or environment variable:

DEPLOY_FREEZE=true              # hard freeze, all deploys blocked
DEPLOY_FREEZE_MODULES=financial # module-specific freeze
DEPLOY_FREEZE_UNTIL=2026-02-05  # auto-lift date

Quick Reference Card

CODEBLOCK16

tonic-system-deploy

软件部署工作流 — 双环境（UAT + PROD）

背景与设计原理

本技能适用于以下系统场景：

- 两个生产环境并存：UAT（测试/预发布）和 PROD（生产环境）
版本可能不一致：UAT 可能领先 PROD 多个版本
部署在夜间进行：自动化流水线按计划时间运行
需要人工审批：未经管理员明确批准，任何代码不得进入 PROD
Bug 需要结构化分类：严重程度、来源环境和版本状态均影响部署路径

核心要点：版本不匹配时选择错误的部署流程可能导致回归问题。流程 1 假设版本一致；流程 2 安全处理版本不一致的情况。

前置条件 — 选择流程前

第 0 步：版本检查（始终优先执行）

问题	答案 →
UAT 和 PROD 版本是否一致？	→ 流程 1
UAT 是否领先 PROD 任意版本？

版本不一致决策树

发现 Bug
│
├─ 严重程度 = 严重/高？
│ └─ 是 → 紧急热修复（跳过流水线）
│
├─ UAT 版本 == PROD 版本？
│ └─ 是 → 流程 1
│
└─ UAT 版本 > PROD 版本？
└─ 是 → 流程 2

流程 1 — 先 UAT（版本一致）

场景：当两个环境运行相同版本时，在 UAT 或 PROD 中发现 Bug。
目标：修复 → 在 UAT 验证 → 推送到 PROD。
结果：UAT 和 PROD 收敛到相同的修补版本。

时间线

报告 Bug
│
│ 🧑 人工：管理员审核并确认 Bug（状态：已确认）
▼
[已确认] — 仅限低/中严重程度
│
│ 🤖 系统：计划部署时间 T1（例如 20:00）
│ - AI 分析根本原因并记录修复方案
│ - 部署修复到 UAT 环境
│ - 状态 → 已部署_UAT
▼
[已部署_UAT]
│ 📲 Telegram：修复已部署到 UAT。请验证。
│
│ 🧑 人工：管理员登录 UAT，验证修复
│ - 运行受影响的工作流程
│ - 确认无回归问题
│ - 点击批准 PROD 部署→ 状态：待部署_PROD
▼
[待部署_PROD]
│ 📲 Telegram：已排队等待 T2 部署到 PROD。
│
│ 🤖 系统：计划部署时间 T2（例如 22:00）
│ - 部署修复到 PROD 环境
│ - 状态 → 已部署_PROD
▼
[已部署_PROD] ✅ 流程 1 完成
│ 📲 Telegram：已部署到 PROD。流程 1 完成。

人工检查点（流程 1）

检查点	执行人	操作	门控条件
确认 Bug	管理员/经理	标记为已确认	Bug 可复现且有效
UAT 验证

管理员/经理 | 点击批准 PROD 部署 | 修复有效，UAT 中无回归 |

自动化节点（流程 1）

时间	节点	输入状态	输出状态	操作
T1	阶段 1	已确认/已计划	已部署UAT	AI 分析 + UAT 部署
T2

阶段 2 | 待部署PROD | 已部署_PROD | PROD 部署 |

流程 2 — 先 PROD（版本不一致）

场景：当 UAT 领先一个或多个版本时，在 PROD 中发现 Bug。
为什么不使用流程 1？ 在较新的 UAT 环境中验证 PROD 修复存在虚假信心的风险——修复在较旧的 PROD 代码库上可能表现不同。
目标：直接修复 PROD → 在 PROD 验证 → 挑选合并回 UAT。
结果：PROD 立即获得修复；UAT 在 PROD 验证后合并修复。

时间线

在 PROD 中发现 Bug（UAT 领先）
│
│ 🧑 人工：管理员审核并确认 Bug
│ - 选择：发现环境 = prod，修复流程 = flow2
│ - 状态 → 已确认
▼
[已确认]
│
│ 🤖 系统：计划部署时间 T1（例如 20:00）
│ - AI 分析根本原因并记录修复方案
│ - 完全跳过 UAT
│ - 排队等待 PROD 部署 → 状态：待部署_PROD
▼
[待部署_PROD]
│ 📲 Telegram：PROD 部署已排队等待 T2（流程 2）。
│
│ 🤖 系统：计划部署时间 T2（例如 22:00）
│ - 部署修复到 PROD
│ - 状态 → 已部署_PROD
▼
[已部署_PROD]
│ 📲 Telegram：已部署到 PROD。请验证 PROD。准备就绪后批准 UAT 合并。
│
│ 🧑 人工：管理员在 PROD 中验证修复
│ - 确认修复在生产数据/配置上有效
│ - PROD 工作流程无回归
│ - 点击批准合并 UAT→ 状态：待合并_UAT
▼
[待合并_UAT]
│ 📲 Telegram：UAT 合并已排队等待今晚 T2。
│
│ 🤖 系统：下一个 T2 周期（22:00）
│ - 部署/合并修复到 UAT 环境
│ - 状态 → 已合并_UAT
▼
[已合并_UAT] ✅ 流程 2 完成
│ 📲 Telegram：已合并到 UAT。流程 2 完成。

人工检查点（流程 2）

检查点	执行人	操作	门控条件
确认 Bug	管理员/经理	标记为已确认 + 选择流程 2	Bug 在 PROD 中确认，版本不一致已验证
PROD 验证

管理员/经理 | 点击批准合并 UAT | 修复在 PROD 中已验证，无回归 |

自动化节点（流程 2）

时间	节点	输入状态	输出状态	操作
T1	阶段 1	已确认/已计划	待部署PROD	AI 分析（跳过 UAT）
T2

流程 2 重要说明

T2 截止时间很重要。 如果管理员在同一天的 T2 之前批准 UAT 合并，则合并将在当晚执行。如果在 T2 之后批准，则将在次日晚上的 T2 执行。务必向团队传达截止时间。

状态参考

状态	流程	颜色	含义	下一步操作
已提交	两者	灰色	Bug 已报告，等待审核	管理员确认/拒绝
已确认

两者 | 蓝色 | 有效 Bug，进入流水线 | T1 自动处理 | | 分析中 | 两者 | 紫色 | AI 分析运行中（临时状态） | 自动 → 已计划 | | 已计划 | 两者 | 靛蓝 | AI 修复方案已记录 | T1 自动部署 | | 已部署_UAT | 流程 1 | 青色 | UAT 已部署，等待人工验证 | 管理员批准 PROD | | 待部署_PROD | 两者 | 黄色 | 已排队等待下一个 T2

tonic-system-deploy补剂系统部署