tonic-system-deploy
Software Deployment Workflow — Dual-Environment (UAT + PROD)
Background & Design Rationale
This skill was designed for systems where:
- - Two live environments co-exist: UAT (testing/staging) and PROD (production)
- Versions can diverge: UAT may be ahead of PROD by several releases
- Deployments are nightly: automated pipelines run at scheduled times
- Human approval is mandatory: no code goes to PROD without explicit admin sign-off
- Bugs require structured triage: severity, origin environment, and version state all affect the deploy path
The key insight: choosing the wrong deploy flow when versions are mismatched can introduce regressions. Flow 1 assumes parity; Flow 2 handles divergence safely.
Prerequisites — Before Choosing a Flow
Step 0: Version Check (always do this first)
| Question | Answer → |
|---|
| Are UAT and PROD on the same version? | → Flow 1 |
| Is UAT ahead of PROD by any version? |
→
Flow 2 |
| Is this a critical/high severity bug? | →
Emergency Hotfix (bypass pipeline) |
| Do you need to undo a bad deploy? | →
Rollback |
Version Mismatch Decision Tree
CODEBLOCK0
Flow 1 — UAT-First (Versions Aligned)
Scenario: Bug found in UAT or PROD when both environments run the same version.
Goal: Fix → validate in UAT → promote to PROD.
Result: UAT and PROD converge to same patched version.
Timeline
CODEBLOCK1
Human Checkpoints (Flow 1)
| Checkpoint | Who | Action | Gate Condition |
|---|
| Confirm bug | Admin/Manager | Mark as confirmed | Bug is reproducible and valid |
| UAT validation |
Admin/Manager | Click "Approve PROD Deploy" | Fix works, no regression in UAT |
Automation Nodes (Flow 1)
| Time | Node | Input Status | Output Status | Action |
|---|
| T1 | Phase 1 | confirmed/planned | deployeduat | AI analysis + UAT deploy |
| T2 |
Phase 2 | pendingprod | deployed_prod | PROD deploy |
Flow 2 — PROD-First (Versions Misaligned)
Scenario: Bug found in PROD when UAT is ahead by one or more versions.
Why not Flow 1? Validating a PROD fix in a newer UAT environment risks false confidence — the fix may behave differently on the older PROD codebase.
Goal: Fix PROD directly → validate in PROD → cherry-pick back to UAT.
Result: PROD gets the fix immediately; UAT gets it merged back after PROD validation.
Timeline
CODEBLOCK2
Human Checkpoints (Flow 2)
| Checkpoint | Who | Action | Gate Condition |
|---|
| Confirm bug | Admin/Manager | Mark as confirmed + select flow2 | Bug confirmed in PROD, version mismatch verified |
| PROD validation |
Admin/Manager | Click "Approve Merge UAT" | Fix verified in PROD, no regression |
Automation Nodes (Flow 2)
| Time | Node | Input Status | Output Status | Action |
|---|
| T1 | Phase 1 | confirmed/planned | pendingprod | AI analysis (skip UAT) |
| T2 |
Phase 2a | pendingprod | deployed_prod | PROD deploy |
| T2 (next) | Phase 2b | pending
uatmerge | uat_merged | UAT deploy/merge |
Flow 2 Important Note
T2 deadline matters. If admin approves UAT merge before T2 on the same day, the merge runs that night. If approved after T2, it runs the following night's T2. Always communicate the cutoff time to the team.
Status Reference
| Status | Flow | Colour | Meaning | Next Action |
|---|
| INLINECODE0 | Both | Grey | Bug reported, awaiting review | Admin confirms/rejects |
| INLINECODE1 |
Both | Blue | Valid bug, enters pipeline | T1 auto-process |
|
analyzing | Both | Purple | AI analysis running (transient) | Auto → planned |
|
planned | Both | Indigo | AI fix plan recorded | T1 auto-deploy |
|
deployed_uat | Flow 1 | Cyan | UAT deployed, awaiting human validation | Admin approves PROD |
|
pending_prod | Both | Yellow | Queued for PROD at next T2 | T2 auto-deploy |
|
deployed_prod | Both | Green | PROD deployed | Flow1: done; Flow2: admin approves UAT merge |
|
pending_uat_merge | Flow 2 | Purple | Queued for UAT merge at next T2 | T2 auto-merge |
|
uat_merged | Flow 2 | Teal | UAT updated with PROD fix | Flow 2 complete ✅ |
|
closed | Both | Emerald | Manually closed | — |
|
rejected | Both | Red | Not a valid bug | — |
Severity Rules
| Severity | Pipeline Eligible? | Notes |
|---|
| INLINECODE11 | ✅ Yes | Both flows |
| INLINECODE12 |
✅ Yes | Both flows |
|
high | ❌ No | Emergency Hotfix only |
|
critical | ❌ No | Emergency Hotfix, immediate escalation |
Never let high/critical bugs wait for a scheduled pipeline. Treat them as emergency hotfixes.
Emergency Hotfix (Bypass Pipeline)
Scenario: Critical or high severity bug in PROD. Cannot wait for scheduled T1/T2.
Process
CODEBLOCK3
Checklist for Emergency Hotfix
- - [ ] Severity confirmed as critical/high before bypassing pipeline
- [ ] At least one other team member notified before deploy
- [ ] Fix deployed and validated within agreed SLA (e.g. P1: 1 hour, P2: 4 hours)
- [ ] Post-deploy smoke test completed (login, core workflow, affected feature)
- [ ] Bug status updated manually in system
- [ ] Telegram/Slack notification sent to stakeholders
- [ ] Post-incident note added to bug record (root cause, fix summary)
- [ ] UAT updated (cherry-pick or re-sync if needed)
- [ ] Incident review scheduled (within 48h for P1)
Rollback Procedure
Scenario: A deploy (T1 or T2) introduces a regression or new failure.
Decision: When to Rollback
CODEBLOCK4
Rollback Process
CODEBLOCK5
Rollback Checklist
- - [ ] Previous working commit/tag identified (git log)
- [ ] Rollback scope defined (frontend / backend / both / DB)
- [ ] Affected bug statuses reverted in system
- [ ] Smoke test completed after rollback
- [ ] Root cause of regression documented
- [ ] Team + stakeholders notified
- [ ] Fix plan for the reverted change recorded
Scheduled Deploy Times (Reference Only)
⚠️ These times are project-specific. Adapt per project SLA and business hours.
| Slot | Name | Phase | Typical Window |
|---|
| T1 | UAT/PROD-queue Deploy | Phase 1 | Off-peak evening (e.g. 20:00) |
| T2 |
PROD/UAT-merge Deploy | Phase 2 | Late evening (e.g. 22:00) |
Principles for choosing T1/T2:
- - T1 and T2 must have enough gap for human validation (minimum 1–2 hours)
- Both should be outside business hours unless urgency demands otherwise
- For 24/7 systems: choose lowest traffic window (check metrics)
- Emergency hotfix: no scheduled time — deploy ASAP after approval
Telegram Notification Templates
Use these as the standard message format for each pipeline node.
T1 Complete — Flow 1 (UAT deployed)
CODEBLOCK6
T2 Complete — Flow 1 (PROD deployed)
CODEBLOCK7
T1 Complete — Flow 2 (PROD queued)
CODEBLOCK8
T2 Complete — Flow 2 (PROD deployed, UAT pending)
CODEBLOCK9
T2 Complete — Flow 2 (UAT merged)
CODEBLOCK10
Emergency Hotfix
CODEBLOCK11
Rollback
⚠️ Rollback 執行 — <environment>
原因:<brief reason>
回滾至:<version/commit>
時間:<datetime>
執行人:<admin>
狀態:已回滾,正在監控
下一步:<scheduled fix / investigation>
Adapting This Workflow to a New Project
When setting up a new project with this workflow:
- 1. Define T1 and T2 — pick times based on traffic patterns and SLA
- Set severity policy — confirm which severities enter pipeline vs emergency hotfix
- Configure Telegram/notification channels — who receives which notifications
- Add DB columns —
fix_flow, found_in_env, and status enum (see Status Reference) - Implement Phase 1 + Phase 2 cron jobs — schedule at T1 and T2
- Add approval endpoints —
approve-prod, approve-uat-merge, batch variants - Add status badges + action buttons — frontend must reflect all statuses clearly
- Test the full cycle in UAT first — simulate a bug through both flows before going live
- Document rollback steps — specific to the project's tech stack and DB
Pre-Deploy Checklist (T1 / T2)
Run before every scheduled deploy window.
- - [ ] DB backup confirmed — last backup < 24h, or trigger manual backup now
- [ ] Monitoring alerts active — error rate, response time, queue depth dashboards open
- [ ] On-call admin reachable — someone available to respond within 15 min post-deploy
- [ ] Change freeze check — not within a freeze window (see Change Freeze Policy)
- [ ] Rollback path clear — previous working commit/tag identified and noted
- [ ] Dependent services healthy — upstream/downstream APIs, DBs, message queues OK
- [ ] Disk + memory OK — server has headroom (>20% free disk, <80% memory)
Post-Deploy Monitoring
After each T1 or T2 deploy, monitor for a minimum of 10 minutes before standing down.
Metrics to Watch
| Metric | Healthy Threshold | Action if Breached |
|---|
| HTTP 5xx error rate | < 0.5% | Investigate immediately, consider rollback |
| API response time (p95) |
< baseline + 20% | Check DB queries, cache hit rate |
| Memory usage | < 85% | Check for memory leaks in new code |
| CPU usage | < 80% sustained | Check for infinite loops or expensive queries |
| Login / auth success rate | > 99% | Auth regression — rollback candidate |
| Key business flow (e.g. task create) | Working end-to-end | Smoke test immediately post-deploy |
Smoke Test Sequence (2–3 min)
- 1. Login with admin account
- Navigate to the affected feature
- Perform the action that triggered the bug
- Confirm fix is working
- Check 2–3 adjacent features for regression
- Check system logs for new errors
If any smoke test step fails → rollback immediately, do not wait.
Multi-Service Deploy (Cross-Service Fixes)
Scenario: A bug fix requires changes to more than one service (e.g. backend API + frontend, or service A + service B).
Deploy Order Principle
CODEBLOCK13
Coordination Steps
- 1. Map dependencies — list all services affected and their dependency order
- Stage the deploys — do not deploy all services simultaneously
- Validate between services — after each service deploy, quick health check before next
- Single rollback plan — define the exact reverse order and what to check at each step
- Lock window — communicate to team that a multi-service deploy is in progress (no other deploys)
Status Tracking for Multi-Service
Tag the bug with affected services. Use release notes to list which service each fix applies to:
[backend] Fix: null pointer in task update handler
[frontend] Fix: error boundary not catching API timeout
Change Freeze Policy
Certain periods should have no scheduled pipeline deploys (T1/T2 suspended). Emergency hotfixes may still be approved by escalation.
Recommended Freeze Windows
| Period | Recommended Action |
|---|
| Public holidays | Suspend T1/T2. Emergency hotfix requires 2-person approval. |
| Lunar New Year (3 days) |
Full freeze. P1 only with CTO sign-off. |
| Major client go-live week | Freeze for that client's system. Other systems normal. |
| End-of-month financial close | Freeze financial modules. Other modules normal. |
| Planned system maintenance | Coordinate freeze window in advance, notify stakeholders. |
Declaring a Freeze
- 1. Update HEARTBEAT.md or project config with freeze start/end dates
- Notify team via Telegram/channel
- Pipeline cron jobs remain scheduled but agent checks freeze flag before executing
- Emergency hotfix during freeze: requires explicit approval from admin + one other senior (two-person rule)
Freeze Flag (implementation)
In pipeline config or environment variable:
DEPLOY_FREEZE=true # hard freeze, all deploys blocked
DEPLOY_FREEZE_MODULES=financial # module-specific freeze
DEPLOY_FREEZE_UNTIL=2026-02-05 # auto-lift date
Quick Reference Card
CODEBLOCK16
tonic-system-deploy
软件部署工作流 — 双环境(UAT + PROD)
背景与设计原理
本技能适用于以下系统场景:
- - 两个生产环境并存:UAT(测试/预发布)和 PROD(生产环境)
- 版本可能不一致:UAT 可能领先 PROD 多个版本
- 部署在夜间进行:自动化流水线按计划时间运行
- 需要人工审批:未经管理员明确批准,任何代码不得进入 PROD
- Bug 需要结构化分类:严重程度、来源环境和版本状态均影响部署路径
核心要点:版本不匹配时选择错误的部署流程可能导致回归问题。流程 1 假设版本一致;流程 2 安全处理版本不一致的情况。
前置条件 — 选择流程前
第 0 步:版本检查(始终优先执行)
| 问题 | 答案 → |
|---|
| UAT 和 PROD 版本是否一致? | → 流程 1 |
| UAT 是否领先 PROD 任意版本? |
→
流程 2 |
| 是否为严重/高优先级 Bug? | →
紧急热修复(绕过流水线) |
| 是否需要撤销错误部署? | →
回滚 |
版本不一致决策树
发现 Bug
│
├─ 严重程度 = 严重/高?
│ └─ 是 → 紧急热修复(跳过流水线)
│
├─ UAT 版本 == PROD 版本?
│ └─ 是 → 流程 1
│
└─ UAT 版本 > PROD 版本?
└─ 是 → 流程 2
流程 1 — 先 UAT(版本一致)
场景:当两个环境运行相同版本时,在 UAT 或 PROD 中发现 Bug。
目标:修复 → 在 UAT 验证 → 推送到 PROD。
结果:UAT 和 PROD 收敛到相同的修补版本。
时间线
报告 Bug
│
│ 🧑 人工:管理员审核并确认 Bug(状态:已确认)
▼
[已确认] — 仅限低/中严重程度
│
│ 🤖 系统:计划部署时间 T1(例如 20:00)
│ - AI 分析根本原因并记录修复方案
│ - 部署修复到 UAT 环境
│ - 状态 → 已部署_UAT
▼
[已部署_UAT]
│ 📲 Telegram:修复已部署到 UAT。请验证。
│
│ 🧑 人工:管理员登录 UAT,验证修复
│ - 运行受影响的工作流程
│ - 确认无回归问题
│ - 点击批准 PROD 部署→ 状态:待部署_PROD
▼
[待部署_PROD]
│ 📲 Telegram:已排队等待 T2 部署到 PROD。
│
│ 🤖 系统:计划部署时间 T2(例如 22:00)
│ - 部署修复到 PROD 环境
│ - 状态 → 已部署_PROD
▼
[已部署_PROD] ✅ 流程 1 完成
│ 📲 Telegram:已部署到 PROD。流程 1 完成。
人工检查点(流程 1)
| 检查点 | 执行人 | 操作 | 门控条件 |
|---|
| 确认 Bug | 管理员/经理 | 标记为已确认 | Bug 可复现且有效 |
| UAT 验证 |
管理员/经理 | 点击批准 PROD 部署 | 修复有效,UAT 中无回归 |
自动化节点(流程 1)
| 时间 | 节点 | 输入状态 | 输出状态 | 操作 |
|---|
| T1 | 阶段 1 | 已确认/已计划 | 已部署UAT | AI 分析 + UAT 部署 |
| T2 |
阶段 2 | 待部署PROD | 已部署_PROD | PROD 部署 |
流程 2 — 先 PROD(版本不一致)
场景:当 UAT 领先一个或多个版本时,在 PROD 中发现 Bug。
为什么不使用流程 1? 在较新的 UAT 环境中验证 PROD 修复存在虚假信心的风险——修复在较旧的 PROD 代码库上可能表现不同。
目标:直接修复 PROD → 在 PROD 验证 → 挑选合并回 UAT。
结果:PROD 立即获得修复;UAT 在 PROD 验证后合并修复。
时间线
在 PROD 中发现 Bug(UAT 领先)
│
│ 🧑 人工:管理员审核并确认 Bug
│ - 选择:发现环境 = prod,修复流程 = flow2
│ - 状态 → 已确认
▼
[已确认]
│
│ 🤖 系统:计划部署时间 T1(例如 20:00)
│ - AI 分析根本原因并记录修复方案
│ - 完全跳过 UAT
│ - 排队等待 PROD 部署 → 状态:待部署_PROD
▼
[待部署_PROD]
│ 📲 Telegram:PROD 部署已排队等待 T2(流程 2)。
│
│ 🤖 系统:计划部署时间 T2(例如 22:00)
│ - 部署修复到 PROD
│ - 状态 → 已部署_PROD
▼
[已部署_PROD]
│ 📲 Telegram:已部署到 PROD。请验证 PROD。准备就绪后批准 UAT 合并。
│
│ 🧑 人工:管理员在 PROD 中验证修复
│ - 确认修复在生产数据/配置上有效
│ - PROD 工作流程无回归
│ - 点击批准合并 UAT→ 状态:待合并_UAT
▼
[待合并_UAT]
│ 📲 Telegram:UAT 合并已排队等待今晚 T2。
│
│ 🤖 系统:下一个 T2 周期(22:00)
│ - 部署/合并修复到 UAT 环境
│ - 状态 → 已合并_UAT
▼
[已合并_UAT] ✅ 流程 2 完成
│ 📲 Telegram:已合并到 UAT。流程 2 完成。
人工检查点(流程 2)
| 检查点 | 执行人 | 操作 | 门控条件 |
|---|
| 确认 Bug | 管理员/经理 | 标记为已确认 + 选择流程 2 | Bug 在 PROD 中确认,版本不一致已验证 |
| PROD 验证 |
管理员/经理 | 点击批准合并 UAT | 修复在 PROD 中已验证,无回归 |
自动化节点(流程 2)
| 时间 | 节点 | 输入状态 | 输出状态 | 操作 |
|---|
| T1 | 阶段 1 | 已确认/已计划 | 待部署PROD | AI 分析(跳过 UAT) |
| T2 |
阶段 2a | 待部署PROD | 已部署_PROD | PROD 部署 |
| T2(下一个) | 阶段 2b | 待合并
UAT | 已合并UAT | UAT 部署/合并 |
流程 2 重要说明
T2 截止时间很重要。 如果管理员在同一天的 T2 之前批准 UAT 合并,则合并将在当晚执行。如果在 T2 之后批准,则将在次日晚上的 T2 执行。务必向团队传达截止时间。
状态参考
| 状态 | 流程 | 颜色 | 含义 | 下一步操作 |
|---|
| 已提交 | 两者 | 灰色 | Bug 已报告,等待审核 | 管理员确认/拒绝 |
| 已确认 |
两者 | 蓝色 | 有效 Bug,进入流水线 | T1 自动处理 |
| 分析中 | 两者 | 紫色 | AI 分析运行中(临时状态) | 自动 → 已计划 |
| 已计划 | 两者 | 靛蓝 | AI 修复方案已记录 | T1 自动部署 |
| 已部署_UAT | 流程 1 | 青色 | UAT 已部署,等待人工验证 | 管理员批准 PROD |
| 待部署_PROD | 两者 | 黄色 | 已排队等待下一个 T2