Runbook Generator

Tier: POWERFUL
Category: Engineering
Domain: DevOps / Site Reliability Engineering

Overview

Analyze a codebase and generate production-grade operational runbooks. Detects your stack (CI/CD, database, hosting, containers), then produces step-by-step runbooks with copy-paste commands, verification checks, rollback procedures, escalation paths, and time estimates. Keeps runbooks fresh with staleness detection linked to config file modification dates.

Core Capabilities

- Stack detection — auto-identify CI/CD, database, hosting, orchestration from repo files
Runbook types — deployment, incident response, database maintenance, scaling, monitoring setup
Format discipline — numbered steps, copy-paste commands, ✅ verification checks, time estimates
Escalation paths — L1 → L2 → L3 with contact info and decision criteria
Rollback procedures — every deployment step has a corresponding undo
Staleness detection — runbook sections reference config files; flag when source changes
Testing methodology — dry-run framework for staging validation, quarterly review cadence

When to Use

Use when:

- A codebase has no runbooks and you need to bootstrap them fast
Existing runbooks are outdated or incomplete (point at the repo, regenerate)
Onboarding a new engineer who needs clear operational procedures
Preparing for an incident response drill or audit
Setting up monitoring and on-call rotation from scratch

Skip when:

- The system is too early-stage to have stable operational patterns
Runbooks already exist and only need minor updates (edit directly)

Stack Detection

When given a repo, scan for these signals before writing a single runbook line:

CODEBLOCK0

Map detected stack → runbook templates. A Next.js + PostgreSQL + Vercel + GitHub Actions repo needs:

- Deployment runbook (Vercel + GitHub Actions)
Database runbook (PostgreSQL backup, migration, vacuum)
Incident response (with Vercel logs + pg query debugging)
Monitoring setup (Vercel Analytics, pg_stat, alerting)

Runbook Types

1. Deployment Runbook

CODEBLOCK1bash
pnpm test
pnpm lint
pnpm build

✅ Expected: All pass with 0 errors. Build output in `.next/`

### Step 2 — Apply database migrations (5 min)

bash

Staging first

DATABASEURL=$STAGINGDATABASE_URL npx prisma migrate deploy

✅ Expected: `All migrations have been successfully applied.`

bash

Verify migration applied

psql $STAGINGDATABASEURL -c "\d" | grep -i migration

✅ Expected: Migration table shows new entry with today's date

### Step 3 — Deploy to production (5 min)

bash
git push origin main

OR trigger manually:

vercel --prod

✅ Expected: Vercel dashboard shows deployment in progress. URL format:
`https://app-name-<hash>-team.vercel.app`

### Step 4 — Smoke test production (5 min)

bash

Health check

curl -sf https://your-app.vercel.app/api/health | jq .

Critical path

curl -sf https://your-app.vercel.app/api/users/me \ -H "Authorization: Bearer $TEST_TOKEN" | jq '.id'

✅ Expected: health returns `{"status":"ok","db":"connected"}`. Users API returns valid ID.

### Step 5 — Monitor for 10 min
- Check Vercel Functions log for errors: `vercel logs --since=10m`
- Check error rate in Vercel Analytics: < 1% 5xx
- Check DB connection pool: `SELECT count(*) FROM pg_stat_activity;` (< 80% of max_connections)

---

## Rollback

If smoke tests fail or error rate spikes:

bash

Instant rollback via Vercel (preferred — < 30 sec)

vercel rollback [previous-deployment-url]

Database rollback (only if migration was applied)

DATABASEURL=$PRODDATABASE_URL npx prisma migrate reset --skip-seed

WARNING: This resets to previous migration. Confirm data impact first.


✅ Expected after rollback: Previous deployment URL becomes active. Verify with smoke test.

---

## Escalation
- **L1 (on-call engineer):** Check Vercel logs, run smoke tests, attempt rollback
- **L2 (platform lead):** DB issues, data loss risk, rollback failed — Slack: @platform-lead
- **L3 (CTO):** Production down > 30 min, data breach — PagerDuty: #critical-incidents

2. Incident Response Runbook

CODEBLOCK8bash

Is the app responding?

curl -sw "%{http_code}" https://your-app.vercel.app/api/health -o /dev/null

Check Vercel function errors (last 15 min)

vercel logs --since=15m | grep -i "error\|exception\|5[0-9][0-9]"

✅ 200 = app up. 5xx or timeout = incident confirmed.

Declare severity:
- Site completely down → P1 — page L2/L3 immediately
- Partial degradation / slow responses → P2 — notify team channel
- Single feature broken → P3 — create ticket, fix in business hours

---

## Phase 2 — Diagnose (10–15 min)

bash

Recent deployments — did something just ship?

vercel ls --limit=5

Database health

psql $DATABASEURL -c "SELECT pid, state, waitevent, query FROM pgstatactivity WHERE state != 'idle' LIMIT 20;"

Long-running queries (> 30 sec)

psql $DATABASEURL -c "SELECT pid, now() - pgstatactivity.querystart AS duration, query FROM pgstatactivity WHERE state = 'active' AND now() - pgstatactivity.query_start > interval '30 seconds';"

Connection pool saturation

psql $DATABASEURL -c "SELECT count(*), maxconn FROM pgstatactivity, (SELECT setting::int AS maxconn FROM pgsettings WHERE name='maxconnections') t GROUP BY maxconn;"


Diagnostic decision tree:
- Recent deploy + new errors → rollback (see Deployment Runbook)
- DB query timeout / pool saturation → kill long queries, scale connections
- External dependency failing → check status pages, add circuit breaker
- Memory/CPU spike → check Vercel function logs for infinite loops

---

## Phase 3 — Mitigate (variable)

bash

Kill a runaway DB query

psql $DATABASEURL -c "SELECT pgterminate_backend();"

Scale DB connections (Supabase/Neon — adjust pool size)

Vercel → Settings → Environment Variables → update DATABASEPOOLMAX

Enable maintenance mode (if you have a feature flag)

vercel env add MAINTENANCE_MODE true production vercel --prod # redeploy with flag


---

## Phase 4 — Resolve & Postmortem

After incident is resolved, within 24 hours:

1. Write incident timeline (what happened, when, who noticed, what fixed it)
2. Identify root cause (5-Whys)
3. Define action items with owners and due dates
4. Update this runbook if a step was missing or wrong
5. Add monitoring/alert that would have caught this earlier

**Postmortem template:** `docs/postmortems/YYYY-MM-DD-incident-title.md`

---

## Escalation Path

| Level | Who | When | Contact |
|-------|-----|------|---------|
| L1 | On-call engineer | Always first | PagerDuty rotation |
| L2 | Platform lead | DB issues, rollback needed | Slack @platform-lead |
| L3 | CTO/VP Eng | P1 > 30 min, data loss | Phone + PagerDuty |

3. Database Maintenance Runbook

CODEBLOCK12bash

Full backup

pgdump $DATABASEURL \
--format=custom \
--compress=9 \
--file="backup-$(date +%Y%m%d-%H%M%S).dump"

✅ Expected: File created, size > 0. `pg_restore --list backup.dump | head -20` shows tables.

Verify backup is restorable (test monthly):

bash
pgrestore --dbname=$STAGINGDATABASE_URL backup.dump
psql $STAGINGDATABASEURL -c "SELECT count(*) FROM users;"

✅ Expected: Row count matches production.

## Migration

bash

Always test in staging first

DATABASEURL=$STAGINGDATABASE_URL npx prisma migrate deploy

Verify, then:

DATABASEURL=$PRODDATABASE_URL npx prisma migrate deploy

✅ Expected: `All migrations have been successfully applied.`

⚠️ For large table migrations (> 1M rows), use `pg_repack` or add column with DEFAULT separately to avoid table locks.

## Vacuum & Reindex

bash

Check bloat before deciding

psql $DATABASE_URL -c "
SELECT schemaname, tablename,
pgsizepretty(pgtotalrelationsize(schemaname||'.'||tablename)) AS totalsize,
ndeadtup, nlivetup,
ROUND(ndeadtup::numeric / NULLIF(nlivetup + ndeadtup, 0) * 100, 1) AS dead_ratio
FROM pgstatuser_tables
ORDER BY ndeadtup DESC LIMIT 10;"

Vacuum high-bloat tables (non-blocking)

psql $DATABASE_URL -c "VACUUM ANALYZE users;" psql $DATABASE_URL -c "VACUUM ANALYZE events;"

Reindex (use CONCURRENTLY to avoid locks)

psql $DATABASEURL -c "REINDEX INDEX CONCURRENTLY usersemail_idx;"

✅ Expected: dead_ratio drops below 5% after vacuum.

Staleness Detection

Add a staleness header to every runbook:

CODEBLOCK17

Automation: Add a CI job that runs weekly and comments on the runbook doc if any referenced file was modified more recently than the runbook's "Last verified" date.

Runbook Testing Methodology

Dry-Run in Staging

Before trusting a runbook in production, validate every step in staging:

CODEBLOCK18

Quarterly Review Cadence

Schedule a 1-hour review every quarter:

1. Run each command in staging — does it still work?
Check config drift — compare "Last Modified" dates vs "Last verified"
Test rollback procedures — actually roll back in staging
Update contact info — L1/L2/L3 may have changed
Add new failure modes discovered in the past quarter
Update "Last verified" date at top of runbook

Common Pitfalls

Pitfall	Fix
Commands that require manual copy of dynamic values	Use env vars — `$DATABASE_URL` not INLINECODE1
No expected output specified

Add ✅ with exact expected string after every verification step | | Rollback steps missing | Every destructive step needs a corresponding undo | | Runbooks that never get tested | Schedule quarterly staging dry-runs in team calendar | | L3 escalation contact is the former CTO | Review contact info every quarter | | Migration runbook doesn't mention table locks | Call out lock risk for large table operations explicitly |

Best Practices

1. Every command must be copy-pasteable — no placeholder text, use env vars
✅ after every step — explicit expected output, not "it should work"
Time estimates are mandatory — engineers need to know if they have time to fix before SLA breach
Rollback before you deploy — plan the undo before executing
Runbooks live in the repo — docs/runbooks/, versioned with the code they describe
Postmortem → runbook update — every incident should improve a runbook
Link, don't duplicate — reference the canonical config file, don't copy its contents into the runbook
Test runbooks like you test code — untested runbooks are worse than no runbooks (false confidence)

Runbook 生成器

层级： 强大
类别： 工程
领域： DevOps / 站点可靠性工程

概述

分析代码库并生成生产级运维手册。检测您的技术栈（CI/CD、数据库、托管、容器），然后生成包含可复制粘贴命令、验证检查、回滚流程、升级路径和时间预估的分步式运维手册。通过与配置文件修改日期关联的过时检测机制，保持运维手册的时效性。

核心能力

- 技术栈检测 — 从仓库文件中自动识别 CI/CD、数据库、托管、编排工具
运维手册类型 — 部署、事件响应、数据库维护、扩缩容、监控搭建
格式规范 — 编号步骤、可复制粘贴命令、✅ 验证检查、时间预估
升级路径 — L1 → L2 → L3，包含联系信息和决策标准
回滚流程 — 每个部署步骤都有对应的撤销操作
过时检测 — 运维手册章节引用配置文件；源文件变更时发出标记
测试方法论 — 用于预发布环境验证的预演框架，季度审查节奏

使用时机

在以下情况使用：

- 代码库没有运维手册，需要快速搭建
现有运维手册过时或不完整（指向仓库，重新生成）
新工程师入职，需要清晰的运维流程
准备事件响应演练或审计
从零开始搭建监控和值班轮换

在以下情况跳过：

- 系统处于早期阶段，尚未形成稳定的运维模式
运维手册已存在，仅需小幅更新（直接编辑）

技术栈检测

给定仓库后，在编写任何运维手册内容前扫描以下信号：

bash

CI/CD

ls .github/workflows/ → GitHub Actions
ls .gitlab-ci.yml → GitLab CI
ls Jenkinsfile → Jenkins
ls .circleci/ → CircleCI
ls bitbucket-pipelines.yml → Bitbucket Pipelines

数据库

grep -r postgresql\|postgres\|pg package.json pyproject.toml → PostgreSQL grep -r mysql\|mariadb package.json → MySQL grep -r mongodb\|mongoose package.json → MongoDB grep -r redis package.json → Redis ls prisma/schema.prisma → Prisma ORM（检查 provider 字段） ls drizzle.config.* → Drizzle ORM

托管

ls vercel.json → Vercel ls railway.toml → Railway ls fly.toml → Fly.io ls .ebextensions/ → AWS Elastic Beanstalk ls terraform/ ls *.tf → 自定义 AWS/GCP/Azure（检查 provider） ls kubernetes/ ls k8s/ → Kubernetes ls docker-compose.yml → Docker Compose

框架

ls next.config.* → Next.js ls nuxt.config.* → Nuxt ls svelte.config.* → SvelteKit cat package.json | jq .scripts → 检查 build/start 命令

将检测到的技术栈映射到运维手册模板。一个 Next.js + PostgreSQL + Vercel + GitHub Actions 仓库需要：

- 部署运维手册（Vercel + GitHub Actions）
数据库运维手册（PostgreSQL 备份、迁移、清理）
事件响应（使用 Vercel 日志 + pg 查询调试）
监控搭建（Vercel Analytics、pg_stat、告警）

运维手册类型

1. 部署运维手册

markdown

部署运维手册 — [应用名称]

技术栈： Next.js 14 + PostgreSQL 15 + Vercel
最后验证时间： 2025-03-01
源配置文件： vercel.json（修改时间：git log -1 --format=%ci -- vercel.json）
负责人： 平台团队
预估总时间： 15–25 分钟

部署前检查清单

- [ ] 所有 PR 已合并到主分支
[ ] 主分支 CI 通过（GitHub Actions 绿色）
[ ] 数据库迁移已在预发布环境测试
[ ] 回滚计划已确认

步骤

步骤 1 — 本地运行 CI 检查（3 分钟）

bash pnpm test pnpm lint pnpm build

✅ 预期：全部通过，0 错误。构建输出在 .next/ 目录中

步骤 2 — 应用数据库迁移（5 分钟）

bash

先在预发布环境执行

DATABASEURL=$STAGINGDATABASE_URL npx prisma migrate deploy

✅ 预期：All migrations have been successfully applied.

bash

验证迁移已应用

psql $STAGINGDATABASEURL -c \d | grep -i migration

✅ 预期：迁移表显示带有今天日期的新条目

步骤 3 — 部署到生产环境（5 分钟）

bash git push origin main

或手动触发：

vercel --prod

✅ 预期：Vercel 仪表板显示部署进行中。URL 格式：
https://app-name--team.vercel.app

步骤 4 — 生产环境冒烟测试（5 分钟）

bash

健康检查

curl -sf https://your-app.vercel.app/api/health | jq .

关键路径

curl -sf https://your-app.vercel.app/api/users/me \ -H Authorization: Bearer $TEST_TOKEN | jq .id

✅ 预期：健康检查返回 {status:ok,db:connected}。用户 API 返回有效 ID。

步骤 5 — 监控 10 分钟

- 检查 Vercel Functions 日志中的错误：vercel logs --since=10m
检查 Vercel Analytics 中的错误率：< 1% 5xx
检查数据库连接池：SELECT count(*) FROM pgstatactivity;（< max_connections 的 80%）

回滚

如果冒烟测试失败或错误率飙升：

bash

通过 Vercel 即时回滚（推荐 — < 30 秒）

vercel rollback [previous-deployment-url]

数据库回滚（仅在迁移已应用时执行）

DATABASEURL=$PRODDATABASE_URL npx prisma migrate reset --skip-seed

警告：这将重置到之前的迁移。先确认数据影响。

✅ 回滚后预期：之前的部署 URL 变为活跃状态。通过冒烟测试验证。

升级路径

- L1（值班工程师）： 检查 Vercel 日志，运行冒烟测试，尝试回滚
L2（平台负责人）： 数据库问题、数据丢失风险、回滚失败 — Slack：@platform-lead
L3（CTO）： 生产环境宕机 > 30 分钟、数据泄露 — PagerDuty：#critical-incidents

2. 事件响应运维手册

markdown

事件响应运维手册

严重级别： P1（宕机）、P2（降级）、P3（轻微）
预估总时间： P1：30–60 分钟、P2：1–4 小时

阶段 1 — 分类（5 分钟）

确认事件

bash

应用是否响应？

curl -sw %{http_code} https://your-app.vercel.app/api/health -o /dev/null

检查 Vercel 函数错误（最近 15 分钟）

vercel logs --since=15m | grep -i error\|exception\|5[0-9][0-9]

✅ 200 = 应用正常运行。5xx 或超时 = 事件已确认。

声明严重级别：

- 站点完全宕机 → P1 — 立即通知 L2/L3
部分降级/响应缓慢 → P2 — 通知团队频道
单个功能故障 → P3 — 创建工单，工作时间修复

阶段 2 — 诊断（10–15 分钟）

bash

最近部署 — 是否有刚上线的内容？

vercel ls --limit=5

数据库健康状态

psql $DATABASEURL -c SELECT pid, state, waitevent, query FROM pgstatactivity WHERE state != idle LIMIT 20;

长时间运行的查询（> 30 秒）

psql $DATABASEURL -c SELECT pid, now() - pgstatactivity.querystart AS duration, query FROM pgstatactivity WHERE state = active AND now() - pgstatactivity.query_start > interval 30 seconds;

连接池饱和

psql $DATABASEURL -c SELECT count(*), max

runbook-generatorRunbook生成器