Runbook Generator
Tier: POWERFUL
Category: Engineering
Domain: DevOps / Site Reliability Engineering
Overview
Analyze a codebase and generate production-grade operational runbooks. Detects your stack (CI/CD, database, hosting, containers), then produces step-by-step runbooks with copy-paste commands, verification checks, rollback procedures, escalation paths, and time estimates. Keeps runbooks fresh with staleness detection linked to config file modification dates.
Core Capabilities
- - Stack detection — auto-identify CI/CD, database, hosting, orchestration from repo files
- Runbook types — deployment, incident response, database maintenance, scaling, monitoring setup
- Format discipline — numbered steps, copy-paste commands, ✅ verification checks, time estimates
- Escalation paths — L1 → L2 → L3 with contact info and decision criteria
- Rollback procedures — every deployment step has a corresponding undo
- Staleness detection — runbook sections reference config files; flag when source changes
- Testing methodology — dry-run framework for staging validation, quarterly review cadence
When to Use
Use when:
- - A codebase has no runbooks and you need to bootstrap them fast
- Existing runbooks are outdated or incomplete (point at the repo, regenerate)
- Onboarding a new engineer who needs clear operational procedures
- Preparing for an incident response drill or audit
- Setting up monitoring and on-call rotation from scratch
Skip when:
- - The system is too early-stage to have stable operational patterns
- Runbooks already exist and only need minor updates (edit directly)
Stack Detection
When given a repo, scan for these signals before writing a single runbook line:
CODEBLOCK0
Map detected stack → runbook templates. A Next.js + PostgreSQL + Vercel + GitHub Actions repo needs:
- - Deployment runbook (Vercel + GitHub Actions)
- Database runbook (PostgreSQL backup, migration, vacuum)
- Incident response (with Vercel logs + pg query debugging)
- Monitoring setup (Vercel Analytics, pg_stat, alerting)
Runbook Types
1. Deployment Runbook
CODEBLOCK1 bash
pnpm test
pnpm lint
pnpm build
✅ Expected: All pass with 0 errors. Build output in `.next/`
### Step 2 — Apply database migrations (5 min)
bash
Staging first
DATABASE
URL=$STAGINGDATABASE_URL npx prisma migrate deploy
✅ Expected: `All migrations have been successfully applied.`
bash
Verify migration applied
psql $STAGING
DATABASEURL -c "\d" | grep -i migration
✅ Expected: Migration table shows new entry with today's date
### Step 3 — Deploy to production (5 min)
bash
git push origin main
OR trigger manually:
vercel --prod
✅ Expected: Vercel dashboard shows deployment in progress. URL format:
`https://app-name-<hash>-team.vercel.app`
### Step 4 — Smoke test production (5 min)
bash
Health check
curl -sf https://your-app.vercel.app/api/health | jq .
Critical path
curl -sf https://your-app.vercel.app/api/users/me \
-H "Authorization: Bearer $TEST_TOKEN" | jq '.id'
✅ Expected: health returns `{"status":"ok","db":"connected"}`. Users API returns valid ID.
### Step 5 — Monitor for 10 min
- Check Vercel Functions log for errors: `vercel logs --since=10m`
- Check error rate in Vercel Analytics: < 1% 5xx
- Check DB connection pool: `SELECT count(*) FROM pg_stat_activity;` (< 80% of max_connections)
---
## Rollback
If smoke tests fail or error rate spikes:
bash
Instant rollback via Vercel (preferred — < 30 sec)
vercel rollback [previous-deployment-url]
Database rollback (only if migration was applied)
DATABASE
URL=$PRODDATABASE_URL npx prisma migrate reset --skip-seed
WARNING: This resets to previous migration. Confirm data impact first.
✅ Expected after rollback: Previous deployment URL becomes active. Verify with smoke test.
---
## Escalation
- **L1 (on-call engineer):** Check Vercel logs, run smoke tests, attempt rollback
- **L2 (platform lead):** DB issues, data loss risk, rollback failed — Slack: @platform-lead
- **L3 (CTO):** Production down > 30 min, data breach — PagerDuty: #critical-incidents
2. Incident Response Runbook
CODEBLOCK8 bash
Is the app responding?
curl -sw "%{http_code}" https://your-app.vercel.app/api/health -o /dev/null
Check Vercel function errors (last 15 min)
vercel logs --since=15m | grep -i "error\|exception\|5[0-9][0-9]"
✅ 200 = app up. 5xx or timeout = incident confirmed.
Declare severity:
- Site completely down → P1 — page L2/L3 immediately
- Partial degradation / slow responses → P2 — notify team channel
- Single feature broken → P3 — create ticket, fix in business hours
---
## Phase 2 — Diagnose (10–15 min)
bash
Recent deployments — did something just ship?
vercel ls --limit=5
Database health
psql $DATABASE
URL -c "SELECT pid, state, waitevent, query FROM pg
statactivity WHERE state != 'idle' LIMIT 20;"
Long-running queries (> 30 sec)
psql $DATABASE
URL -c "SELECT pid, now() - pgstat
activity.querystart AS duration, query FROM pg
statactivity WHERE state = 'active' AND now() - pg
statactivity.query_start > interval '30 seconds';"
Connection pool saturation
psql $DATABASE
URL -c "SELECT count(*), maxconn FROM pg
statactivity, (SELECT setting::int AS max
conn FROM pgsettings WHERE name='max
connections') t GROUP BY maxconn;"
Diagnostic decision tree:
- Recent deploy + new errors → rollback (see Deployment Runbook)
- DB query timeout / pool saturation → kill long queries, scale connections
- External dependency failing → check status pages, add circuit breaker
- Memory/CPU spike → check Vercel function logs for infinite loops
---
## Phase 3 — Mitigate (variable)
bash
Kill a runaway DB query
psql $DATABASE
URL -c "SELECT pgterminate_backend(
);"
Scale DB connections (Supabase/Neon — adjust pool size)
Vercel → Settings → Environment Variables → update DATABASEPOOLMAX
Enable maintenance mode (if you have a feature flag)
vercel env add MAINTENANCE_MODE true production
vercel --prod # redeploy with flag
---
## Phase 4 — Resolve & Postmortem
After incident is resolved, within 24 hours:
1. Write incident timeline (what happened, when, who noticed, what fixed it)
2. Identify root cause (5-Whys)
3. Define action items with owners and due dates
4. Update this runbook if a step was missing or wrong
5. Add monitoring/alert that would have caught this earlier
**Postmortem template:** `docs/postmortems/YYYY-MM-DD-incident-title.md`
---
## Escalation Path
| Level | Who | When | Contact |
|-------|-----|------|---------|
| L1 | On-call engineer | Always first | PagerDuty rotation |
| L2 | Platform lead | DB issues, rollback needed | Slack @platform-lead |
| L3 | CTO/VP Eng | P1 > 30 min, data loss | Phone + PagerDuty |
3. Database Maintenance Runbook
CODEBLOCK12 bash
Full backup
pgdump $DATABASEURL \
--format=custom \
--compress=9 \
--file="backup-$(date +%Y%m%d-%H%M%S).dump"
✅ Expected: File created, size > 0. `pg_restore --list backup.dump | head -20` shows tables.
Verify backup is restorable (test monthly):
bash
pgrestore --dbname=$STAGINGDATABASE_URL backup.dump
psql $STAGINGDATABASEURL -c "SELECT count(*) FROM users;"
✅ Expected: Row count matches production.
## Migration
bash
Always test in staging first
DATABASEURL=$STAGINGDATABASE_URL npx prisma migrate deploy
Verify, then:
DATABASEURL=$PRODDATABASE_URL npx prisma migrate deploy
✅ Expected: `All migrations have been successfully applied.`
⚠️ For large table migrations (> 1M rows), use `pg_repack` or add column with DEFAULT separately to avoid table locks.
## Vacuum & Reindex
bash
Check bloat before deciding
psql $DATABASE_URL -c "
SELECT schemaname, tablename,
pgsizepretty(pgtotalrelationsize(schemaname||'.'||tablename)) AS totalsize,
ndeadtup, nlivetup,
ROUND(ndeadtup::numeric / NULLIF(nlivetup + ndeadtup, 0) * 100, 1) AS dead_ratio
FROM pgstatuser_tables
ORDER BY ndeadtup DESC LIMIT 10;"
Vacuum high-bloat tables (non-blocking)
psql $DATABASE_URL -c "VACUUM ANALYZE users;"
psql $DATABASE_URL -c "VACUUM ANALYZE events;"
Reindex (use CONCURRENTLY to avoid locks)
psql $DATABASEURL -c "REINDEX INDEX CONCURRENTLY usersemail_idx;"
✅ Expected: dead_ratio drops below 5% after vacuum.
Staleness Detection
Add a staleness header to every runbook:
CODEBLOCK17
Automation: Add a CI job that runs weekly and comments on the runbook doc if any referenced file was modified more recently than the runbook's "Last verified" date.
Runbook Testing Methodology
Dry-Run in Staging
Before trusting a runbook in production, validate every step in staging:
CODEBLOCK18
Quarterly Review Cadence
Schedule a 1-hour review every quarter:
- 1. Run each command in staging — does it still work?
- Check config drift — compare "Last Modified" dates vs "Last verified"
- Test rollback procedures — actually roll back in staging
- Update contact info — L1/L2/L3 may have changed
- Add new failure modes discovered in the past quarter
- Update "Last verified" date at top of runbook
Common Pitfalls
| Pitfall | Fix |
|---|
| Commands that require manual copy of dynamic values | Use env vars — $DATABASE_URL not INLINECODE1 |
| No expected output specified |
Add ✅ with exact expected string after every verification step |
| Rollback steps missing | Every destructive step needs a corresponding undo |
| Runbooks that never get tested | Schedule quarterly staging dry-runs in team calendar |
| L3 escalation contact is the former CTO | Review contact info every quarter |
| Migration runbook doesn't mention table locks | Call out lock risk for large table operations explicitly |
Best Practices
- 1. Every command must be copy-pasteable — no placeholder text, use env vars
- ✅ after every step — explicit expected output, not "it should work"
- Time estimates are mandatory — engineers need to know if they have time to fix before SLA breach
- Rollback before you deploy — plan the undo before executing
- Runbooks live in the repo —
docs/runbooks/, versioned with the code they describe - Postmortem → runbook update — every incident should improve a runbook
- Link, don't duplicate — reference the canonical config file, don't copy its contents into the runbook
- Test runbooks like you test code — untested runbooks are worse than no runbooks (false confidence)
Runbook 生成器
层级: 强大
类别: 工程
领域: DevOps / 站点可靠性工程
概述
分析代码库并生成生产级运维手册。检测您的技术栈(CI/CD、数据库、托管、容器),然后生成包含可复制粘贴命令、验证检查、回滚流程、升级路径和时间预估的分步式运维手册。通过与配置文件修改日期关联的过时检测机制,保持运维手册的时效性。
核心能力
- - 技术栈检测 — 从仓库文件中自动识别 CI/CD、数据库、托管、编排工具
- 运维手册类型 — 部署、事件响应、数据库维护、扩缩容、监控搭建
- 格式规范 — 编号步骤、可复制粘贴命令、✅ 验证检查、时间预估
- 升级路径 — L1 → L2 → L3,包含联系信息和决策标准
- 回滚流程 — 每个部署步骤都有对应的撤销操作
- 过时检测 — 运维手册章节引用配置文件;源文件变更时发出标记
- 测试方法论 — 用于预发布环境验证的预演框架,季度审查节奏
使用时机
在以下情况使用:
- - 代码库没有运维手册,需要快速搭建
- 现有运维手册过时或不完整(指向仓库,重新生成)
- 新工程师入职,需要清晰的运维流程
- 准备事件响应演练或审计
- 从零开始搭建监控和值班轮换
在以下情况跳过:
- - 系统处于早期阶段,尚未形成稳定的运维模式
- 运维手册已存在,仅需小幅更新(直接编辑)
技术栈检测
给定仓库后,在编写任何运维手册内容前扫描以下信号:
bash
CI/CD
ls .github/workflows/ → GitHub Actions
ls .gitlab-ci.yml → GitLab CI
ls Jenkinsfile → Jenkins
ls .circleci/ → CircleCI
ls bitbucket-pipelines.yml → Bitbucket Pipelines
数据库
grep -r postgresql\|postgres\|pg package.json pyproject.toml → PostgreSQL
grep -r mysql\|mariadb package.json → MySQL
grep -r mongodb\|mongoose package.json → MongoDB
grep -r redis package.json → Redis
ls prisma/schema.prisma → Prisma ORM(检查 provider 字段)
ls drizzle.config.* → Drizzle ORM
托管
ls vercel.json → Vercel
ls railway.toml → Railway
ls fly.toml → Fly.io
ls .ebextensions/ → AWS Elastic Beanstalk
ls terraform/ ls *.tf → 自定义 AWS/GCP/Azure(检查 provider)
ls kubernetes/ ls k8s/ → Kubernetes
ls docker-compose.yml → Docker Compose
框架
ls next.config.* → Next.js
ls nuxt.config.* → Nuxt
ls svelte.config.* → SvelteKit
cat package.json | jq .scripts → 检查 build/start 命令
将检测到的技术栈映射到运维手册模板。一个 Next.js + PostgreSQL + Vercel + GitHub Actions 仓库需要:
- - 部署运维手册(Vercel + GitHub Actions)
- 数据库运维手册(PostgreSQL 备份、迁移、清理)
- 事件响应(使用 Vercel 日志 + pg 查询调试)
- 监控搭建(Vercel Analytics、pg_stat、告警)
运维手册类型
1. 部署运维手册
markdown
部署运维手册 — [应用名称]
技术栈: Next.js 14 + PostgreSQL 15 + Vercel
最后验证时间: 2025-03-01
源配置文件: vercel.json(修改时间:git log -1 --format=%ci -- vercel.json)
负责人: 平台团队
预估总时间: 15–25 分钟
部署前检查清单
- - [ ] 所有 PR 已合并到主分支
- [ ] 主分支 CI 通过(GitHub Actions 绿色)
- [ ] 数据库迁移已在预发布环境测试
- [ ] 回滚计划已确认
步骤
步骤 1 — 本地运行 CI 检查(3 分钟)
bash
pnpm test
pnpm lint
pnpm build
✅ 预期:全部通过,0 错误。构建输出在 .next/ 目录中
步骤 2 — 应用数据库迁移(5 分钟)
bash
先在预发布环境执行
DATABASE
URL=$STAGINGDATABASE_URL npx prisma migrate deploy
✅ 预期:All migrations have been successfully applied.
bash
验证迁移已应用
psql $STAGING
DATABASEURL -c \d | grep -i migration
✅ 预期:迁移表显示带有今天日期的新条目
步骤 3 — 部署到生产环境(5 分钟)
bash
git push origin main
或手动触发:
vercel --prod
✅ 预期:Vercel 仪表板显示部署进行中。URL 格式:
https://app-name--team.vercel.app
步骤 4 — 生产环境冒烟测试(5 分钟)
bash
健康检查
curl -sf https://your-app.vercel.app/api/health | jq .
关键路径
curl -sf https://your-app.vercel.app/api/users/me \
-H Authorization: Bearer $TEST_TOKEN | jq .id
✅ 预期:健康检查返回 {status:ok,db:connected}。用户 API 返回有效 ID。
步骤 5 — 监控 10 分钟
- - 检查 Vercel Functions 日志中的错误:vercel logs --since=10m
- 检查 Vercel Analytics 中的错误率:< 1% 5xx
- 检查数据库连接池:SELECT count(*) FROM pgstatactivity;(< max_connections 的 80%)
回滚
如果冒烟测试失败或错误率飙升:
bash
通过 Vercel 即时回滚(推荐 — < 30 秒)
vercel rollback [previous-deployment-url]
数据库回滚(仅在迁移已应用时执行)
DATABASE
URL=$PRODDATABASE_URL npx prisma migrate reset --skip-seed
警告:这将重置到之前的迁移。先确认数据影响。
✅ 回滚后预期:之前的部署 URL 变为活跃状态。通过冒烟测试验证。
升级路径
- - L1(值班工程师): 检查 Vercel 日志,运行冒烟测试,尝试回滚
- L2(平台负责人): 数据库问题、数据丢失风险、回滚失败 — Slack:@platform-lead
- L3(CTO): 生产环境宕机 > 30 分钟、数据泄露 — PagerDuty:#critical-incidents
2. 事件响应运维手册
markdown
事件响应运维手册
严重级别: P1(宕机)、P2(降级)、P3(轻微)
预估总时间: P1:30–60 分钟、P2:1–4 小时
阶段 1 — 分类(5 分钟)
确认事件
bash
应用是否响应?
curl -sw %{http_code} https://your-app.vercel.app/api/health -o /dev/null
检查 Vercel 函数错误(最近 15 分钟)
vercel logs --since=15m | grep -i error\|exception\|5[0-9][0-9]
✅ 200 = 应用正常运行。5xx 或超时 = 事件已确认。
声明严重级别:
- - 站点完全宕机 → P1 — 立即通知 L2/L3
- 部分降级/响应缓慢 → P2 — 通知团队频道
- 单个功能故障 → P3 — 创建工单,工作时间修复
阶段 2 — 诊断(10–15 分钟)
bash
最近部署 — 是否有刚上线的内容?
vercel ls --limit=5
数据库健康状态
psql $DATABASE
URL -c SELECT pid, state, waitevent, query FROM pg
statactivity WHERE state != idle LIMIT 20;
长时间运行的查询(> 30 秒)
psql $DATABASE
URL -c SELECT pid, now() - pgstat
activity.querystart AS duration, query FROM pg
statactivity WHERE state = active AND now() - pg
statactivity.query_start > interval 30 seconds;
连接池饱和
psql $DATABASE
URL -c SELECT count(*), max