Agent QA Gates
A field-tested validation system for AI agent output. Born from production failures, not theory.
Quick Start
Before any agent delivers output, run the Pre-Ship Checklist:
- 1. Accurate? — every number/date/metric has a source. Unsourced → prefix "estimated"
- Complete? — no missing pieces, no "I'll do that next"
- Actionable? — ends with clear next step or decision point
- Fits the channel? — check character limits for your delivery surface
- No leaks? — no internal context, private data, or secrets
- Not a duplicate? — verify no recent identical send
- Would the human be embarrassed? — if yes, don't ship
Gate Tiers
Four ascending tiers by risk level:
| Gate | Scope | Key Checks |
|---|
| Gate 0 | Internal (files, config, memory) | Mechanism changed not just text, no placeholders, file exists |
| Gate 1 |
Human-facing (briefings, summaries) | Key info in first 2 lines, ≤3-line paragraphs, channel length limits |
| Gate 2 | External (email, public content, client materials) | No internal context leaked, recipient-appropriate tone, dedup check |
| Gate 3 | Code & technical | Builds clean, no secrets in code, error handling, tests pass |
See references/gates-detail.md for full gate checklists.
Severity Classification
Not all failures are equal:
- - 🔴 BLOCK — cannot ship (secrets, privacy, hallucinated data, wrong recipient)
- 🟡 FIX — fix before shipping, <2 min (formatting, too long, missing citation)
- 🟢 NOTE — log and ship (style preference, minor optimization)
Protocol Gates
Recurring failure modes need dedicated gates. These are the most common:
Heartbeat / Periodic Check Output
- - Binary output: alert text ONLY or status-OK ONLY. Never mixed.
- Every data point verified by current-session tool call. No hallucinated metrics.
- No stale data from previous cycles or pre-compaction sessions.
Post-Compaction / Context Reset
- - Do not trust facts from the pre-reset session — verify from files and tools.
- Rerun pending checks from scratch.
- Zero carryover for periodic checks.
Scheduled Job / Cron Changes
- - Explicit timeout set
- Explicit model set
- Verify schedule after creation
- Output fits destination channel limits
Sub-Agent Output Review
- - Does output match the brief's success criteria?
- Any uncertainty flags unresolved?
- Is the reasoning (not just the conclusion) sound?
Isolated Agent / Cron Output (real-world data)
For any cron or sub-agent that reports external data without orchestrator review:
- - Did the agent make a verifiable live tool call? Is the raw response traceable?
- Any names, dates, amounts, or IDs that can't be traced to a tool result? → 🔴 BLOCK
- If tool call failed: output must be
DATA_UNAVAILABLE — [reason], not fabricated data - Does the cron prompt include the Real-World Data Verification Rule?
Severity: Fabricated real-world data = 🔴 BLOCK. Same as hallucinated metrics.
Delegated Work Acceptance
For any non-trivial delegated task (especially builds, audits, config changes, or external deliverables):
- - Does the handoff include a clear artifact path or proof object?
- Did the worker report exact commands run rather than vague claims?
- Did verification actually happen, with results stated?
- Is the output non-empty and specific, not just "done" or "completed successfully"?
- Are known gaps / next actions named explicitly?
- If the handoff is empty, artifact-free, or self-certifying without proof → 🔴 BLOCK
- Valid dispositions:
Done, Revision Needed, Blocked, Failed, INLINECODE6
Silent Worker / Stale Task Classification
For delegated work that appears to be running:
- - Was the spawn actually accepted? If not, it is not running.
- No start signal within 10 minutes after accepted spawn → INLINECODE7
- No materially new output for 30 minutes on active work →
Stale unless the task explicitly justifies a longer quiet window - Stale work must be investigated, respawned, or escalated — never left as indefinite INLINECODE9
Gate Evolution
Gates should evolve based on real failures, not imagination:
- 1. When a failure occurs → log it with root cause
- Same failure class occurs 2+ times → add a gate item
- Monthly: prune gates that haven't caught anything in 60 days
Anti-Patterns
- - Gates that sound good but never catch anything → kill them
- Per-agent checklists that duplicate general gates → merge or reference
- "ADHD-friendly" or "high-quality" as gate items → not testable, replace with mechanical checks
- Aspirational gates nobody runs → either automate or cut
Adapting to Your System
This skill provides the pattern. Adapt it:
- 1. Start with the Pre-Ship Checklist — it works for any agent system
- Add Protocol Gates for your top 3 recurring failure modes
- Set channel limits for your delivery surfaces
- Map real failures to gates — if a failure isn't gated, add the gate
- Kill gates that never fire — a shorter, sharper checklist wins
For the full reference implementation, see references/gates-detail.md.
For automation scripts, see scripts/qa-check.sh.
智能体质量门控
一套经过实地验证的AI智能体输出验证系统。源于生产故障,而非理论空想。
快速上手
在任何智能体交付输出前,执行发货前检查清单:
- 1. 准确吗? — 每个数字/日期/指标都有来源。无来源 → 前缀标注估算值
- 完整吗? — 无遗漏内容,无我稍后再做之类表述
- 可执行吗? — 以明确的下一步或决策点结尾
- 适配渠道吗? — 检查交付界面的字符限制
- 无泄露吗? — 无内部上下文、私有数据或机密信息
- 非重复吗? — 确认近期无相同发送记录
- 人类会尴尬吗? — 如果是,则不要发货
门控层级
按风险等级分为四个递增层级:
| 门控 | 范围 | 关键检查项 |
|---|
| 门控0 | 内部(文件、配置、内存) | 机制已变更而非仅文本修改,无占位符,文件存在 |
| 门控1 |
面向人类(简报、摘要) | 关键信息在前2行内,段落≤3行,符合渠道长度限制 |
| 门控2 | 外部(邮件、公开内容、客户材料) | 无内部上下文泄露,语气适配收件人,重复检查 |
| 门控3 | 代码与技术 | 构建干净,代码中无机密信息,错误处理,测试通过 |
完整门控检查清单见 references/gates-detail.md。
严重程度分类
并非所有故障都同等重要:
- - 🔴 阻止 — 不可发货(机密信息、隐私泄露、数据幻觉、收件人错误)
- 🟡 修复 — 发货前修复,耗时<2分钟(格式问题、过长、缺少引用)
- 🟢 备注 — 记录并发货(风格偏好、微小优化)
协议门控
重复出现的故障模式需要专用门控。以下是最常见的:
心跳/周期性检查输出
- - 二进制输出:仅告警文本或仅状态正常。绝不混合。
- 每个数据点由当前会话工具调用验证。无幻觉指标。
- 无来自先前周期或压缩前会话的过时数据。
压缩后/上下文重置
- - 不要信任重置前会话的事实 — 从文件和工具中验证。
- 从头重新运行待处理检查。
- 周期性检查零继承。
定时任务/Cron变更
- - 设置显式超时
- 设置显式模型
- 创建后验证调度
- 输出适配目标渠道限制
子智能体输出审查
- - 输出是否符合简报的成功标准?
- 是否有未解决的不确定性标记?
- 推理过程(而不仅仅是结论)是否合理?
独立智能体/Cron输出(真实世界数据)
对于任何报告外部数据且无编排器审查的cron或子智能体:
- - 智能体是否进行了可验证的实时工具调用?原始响应是否可追溯?
- 是否存在无法追溯到工具结果的任何名称、日期、金额或ID?→ 🔴 阻止
- 如果工具调用失败:输出必须为DATA_UNAVAILABLE — [原因],而非编造数据
- cron提示是否包含真实世界数据验证规则?
严重程度: 编造真实世界数据 = 🔴 阻止。等同于数据幻觉。
委派工作验收
对于任何非琐碎的委派任务(尤其是构建、审计、配置变更或外部交付物):
- - 交接是否包含清晰的工件路径或证明对象?
- 工作者是否报告了确切运行的命令而非模糊声明?
- 验证是否实际发生,并陈述了结果?
- 输出是否非空且具体,而非仅完成或成功完成?
- 已知差距/下一步是否明确命名?
- 如果交接为空、无工件或自我认证而无证明 → 🔴 阻止
- 有效处置:完成、需要修订、受阻、失败、过时
静默工作者/过时任务分类
对于看似正在运行的委派工作:
- - 生成是否实际被接受?如否,则未在运行。
- 接受生成后10分钟内无启动信号 → 过时
- 活跃工作30分钟内无实质性新输出 → 过时,除非任务明确证明需要更长的静默窗口
- 过时工作必须被调查、重新生成或升级 — 绝不可无限期保留为进行中
门控演进
门控应基于实际故障而非想象来演进:
- 1. 发生故障时 → 记录故障及根本原因
- 同一故障类别出现2次以上 → 添加门控项
- 每月:修剪60天内未捕获任何问题的门控
反模式
- - 听起来不错但从未捕获任何问题的门控 → 删除
- 重复通用门控的每个智能体检查清单 → 合并或引用
- ADHD友好或高质量作为门控项 → 不可测试,替换为机械检查
- 无人执行的理想化门控 → 要么自动化,要么删除
适配您的系统
本技能提供模式。请适配:
- 1. 从发货前检查清单开始 — 适用于任何智能体系统
- 添加协议门控 针对您最常见的3种重复故障模式
- 设置渠道限制 针对您的交付界面
- 将实际故障映射到门控 — 如果故障未被门控覆盖,则添加门控
- 删除从未触发的门控 — 更短、更精准的检查清单胜出
完整参考实现见 references/gates-detail.md。
自动化脚本见 scripts/qa-check.sh。