Agent QA Gates

A field-tested validation system for AI agent output. Born from production failures, not theory.

Quick Start

Before any agent delivers output, run the Pre-Ship Checklist:

1. Accurate? — every number/date/metric has a source. Unsourced → prefix "estimated"
Complete? — no missing pieces, no "I'll do that next"
Actionable? — ends with clear next step or decision point
Fits the channel? — check character limits for your delivery surface
No leaks? — no internal context, private data, or secrets
Not a duplicate? — verify no recent identical send
Would the human be embarrassed? — if yes, don't ship

Gate Tiers

Four ascending tiers by risk level:

Gate	Scope	Key Checks
Gate 0	Internal (files, config, memory)	Mechanism changed not just text, no placeholders, file exists
Gate 1

See references/gates-detail.md for full gate checklists.

Severity Classification

Not all failures are equal:

- 🔴 BLOCK — cannot ship (secrets, privacy, hallucinated data, wrong recipient)
🟡 FIX — fix before shipping, <2 min (formatting, too long, missing citation)
🟢 NOTE — log and ship (style preference, minor optimization)

Protocol Gates

Recurring failure modes need dedicated gates. These are the most common:

Heartbeat / Periodic Check Output

- Binary output: alert text ONLY or status-OK ONLY. Never mixed.
Every data point verified by current-session tool call. No hallucinated metrics.
No stale data from previous cycles or pre-compaction sessions.

Post-Compaction / Context Reset

- Do not trust facts from the pre-reset session — verify from files and tools.
Rerun pending checks from scratch.
Zero carryover for periodic checks.

Scheduled Job / Cron Changes

- Explicit timeout set
Explicit model set
Verify schedule after creation
Output fits destination channel limits

Sub-Agent Output Review

- Does output match the brief's success criteria?
Any uncertainty flags unresolved?
Is the reasoning (not just the conclusion) sound?

Isolated Agent / Cron Output (real-world data)

For any cron or sub-agent that reports external data without orchestrator review:

- Did the agent make a verifiable live tool call? Is the raw response traceable?
Any names, dates, amounts, or IDs that can't be traced to a tool result? → 🔴 BLOCK
If tool call failed: output must be DATA_UNAVAILABLE — [reason], not fabricated data
Does the cron prompt include the Real-World Data Verification Rule?

Severity: Fabricated real-world data = 🔴 BLOCK. Same as hallucinated metrics.

Delegated Work Acceptance

For any non-trivial delegated task (especially builds, audits, config changes, or external deliverables):

- Does the handoff include a clear artifact path or proof object?
Did the worker report exact commands run rather than vague claims?
Did verification actually happen, with results stated?
Is the output non-empty and specific, not just "done" or "completed successfully"?
Are known gaps / next actions named explicitly?
If the handoff is empty, artifact-free, or self-certifying without proof → 🔴 BLOCK
Valid dispositions: Done, Revision Needed, Blocked, Failed, INLINECODE6

Silent Worker / Stale Task Classification

For delegated work that appears to be running:

- Was the spawn actually accepted? If not, it is not running.
No start signal within 10 minutes after accepted spawn → INLINECODE7
No materially new output for 30 minutes on active work → Stale unless the task explicitly justifies a longer quiet window
Stale work must be investigated, respawned, or escalated — never left as indefinite INLINECODE9

Gate Evolution

Gates should evolve based on real failures, not imagination:

1. When a failure occurs → log it with root cause
Same failure class occurs 2+ times → add a gate item
Monthly: prune gates that haven't caught anything in 60 days

Anti-Patterns

- Gates that sound good but never catch anything → kill them
Per-agent checklists that duplicate general gates → merge or reference
"ADHD-friendly" or "high-quality" as gate items → not testable, replace with mechanical checks
Aspirational gates nobody runs → either automate or cut

Adapting to Your System

This skill provides the pattern. Adapt it:

1. Start with the Pre-Ship Checklist — it works for any agent system
Add Protocol Gates for your top 3 recurring failure modes
Set channel limits for your delivery surfaces
Map real failures to gates — if a failure isn't gated, add the gate
Kill gates that never fire — a shorter, sharper checklist wins

For the full reference implementation, see references/gates-detail.md.
For automation scripts, see scripts/qa-check.sh.

智能体质量门控

一套经过实地验证的AI智能体输出验证系统。源于生产故障，而非理论空想。

快速上手

在任何智能体交付输出前，执行发货前检查清单：

1. 准确吗？ — 每个数字/日期/指标都有来源。无来源 → 前缀标注估算值
完整吗？ — 无遗漏内容，无我稍后再做之类表述
可执行吗？ — 以明确的下一步或决策点结尾
适配渠道吗？ — 检查交付界面的字符限制
无泄露吗？ — 无内部上下文、私有数据或机密信息
非重复吗？ — 确认近期无相同发送记录
人类会尴尬吗？ — 如果是，则不要发货

门控层级

按风险等级分为四个递增层级：

门控	范围	关键检查项
门控0	内部（文件、配置、内存）	机制已变更而非仅文本修改，无占位符，文件存在
门控1

完整门控检查清单见 references/gates-detail.md。

严重程度分类

并非所有故障都同等重要：

- 🔴 阻止 — 不可发货（机密信息、隐私泄露、数据幻觉、收件人错误）
🟡 修复 — 发货前修复，耗时<2分钟（格式问题、过长、缺少引用）
🟢 备注 — 记录并发货（风格偏好、微小优化）

协议门控

重复出现的故障模式需要专用门控。以下是最常见的：

心跳/周期性检查输出

- 二进制输出：仅告警文本或仅状态正常。绝不混合。
每个数据点由当前会话工具调用验证。无幻觉指标。
无来自先前周期或压缩前会话的过时数据。

压缩后/上下文重置

- 不要信任重置前会话的事实 — 从文件和工具中验证。
从头重新运行待处理检查。
周期性检查零继承。

定时任务/Cron变更

- 设置显式超时
设置显式模型
创建后验证调度
输出适配目标渠道限制

子智能体输出审查

- 输出是否符合简报的成功标准？
是否有未解决的不确定性标记？
推理过程（而不仅仅是结论）是否合理？

独立智能体/Cron输出（真实世界数据）

对于任何报告外部数据且无编排器审查的cron或子智能体：

- 智能体是否进行了可验证的实时工具调用？原始响应是否可追溯？
是否存在无法追溯到工具结果的任何名称、日期、金额或ID？→ 🔴 阻止
如果工具调用失败：输出必须为DATA_UNAVAILABLE — [原因]，而非编造数据
cron提示是否包含真实世界数据验证规则？

严重程度： 编造真实世界数据 = 🔴 阻止。等同于数据幻觉。

委派工作验收

对于任何非琐碎的委派任务（尤其是构建、审计、配置变更或外部交付物）：

- 交接是否包含清晰的工件路径或证明对象？
工作者是否报告了确切运行的命令而非模糊声明？
验证是否实际发生，并陈述了结果？
输出是否非空且具体，而非仅完成或成功完成？
已知差距/下一步是否明确命名？
如果交接为空、无工件或自我认证而无证明 → 🔴 阻止
有效处置：完成、需要修订、受阻、失败、过时

静默工作者/过时任务分类

对于看似正在运行的委派工作：

- 生成是否实际被接受？如否，则未在运行。
接受生成后10分钟内无启动信号 → 过时
活跃工作30分钟内无实质性新输出 → 过时，除非任务明确证明需要更长的静默窗口
过时工作必须被调查、重新生成或升级 — 绝不可无限期保留为进行中

门控演进

门控应基于实际故障而非想象来演进：

1. 发生故障时 → 记录故障及根本原因
同一故障类别出现2次以上 → 添加门控项
每月：修剪60天内未捕获任何问题的门控

反模式

- 听起来不错但从未捕获任何问题的门控 → 删除
重复通用门控的每个智能体检查清单 → 合并或引用
ADHD友好或高质量作为门控项 → 不可测试，替换为机械检查
无人执行的理想化门控 → 要么自动化，要么删除

适配您的系统

本技能提供模式。请适配：

1. 从发货前检查清单开始 — 适用于任何智能体系统
添加协议门控 针对您最常见的3种重复故障模式
设置渠道限制 针对您的交付界面
将实际故障映射到门控 — 如果故障未被门控覆盖，则添加门控
删除从未触发的门控 — 更短、更精准的检查清单胜出

完整参考实现见 references/gates-detail.md。
自动化脚本见 scripts/qa-check.sh。

agent-qa-gates智能输出校验门

agent-qa-gates

Agent QA Gates

Quick Start

Gate Tiers

Severity Classification