Agent Scorecard Output Quality Framework

Configurable quality evaluation for AI agent outputs. Define criteria, run evaluations, track quality over time. No LLM-as-judge, no API calls, pattern-based automated checks.

Configurable quality evaluation for AI agent outputs. Define criteria, run evaluations, track quality over time.

Agent Scorecard gives you a structured, repeatable way to measure whether your AI agent is producing good output — and whether it's getting better or worse over time. No LLM-as-judge, no API calls, no external dependencies. Everything runs locally with pattern-based automated checks and optional human scoring.

The Problem

You changed your agent's system prompt. Is the output better now? You don't know. You added a new tool. Did response quality degrade? You have a feeling, but no data. Quality management for AI agents is mostly vibes.

Agent Scorecard replaces vibes with numbers.

What It Does

1. Define Quality Dimensions (`config_example.json`)

- Configure what "quality" means for your use case
Set dimensions: accuracy, completeness, tone, format compliance, consistency — or your own
Define rubrics (what does a 1 vs a 5 look like for each dimension?)
Set weights (accuracy matters more than tone? Give it 2× weight)
Set pass/fail thresholds per dimension

2. Evaluate (`scorecard.py`)

- Automated mode: Pattern-based checks run instantly with zero API calls

- Response length analysis (too short? too long?) - Format compliance (expected headers, lists, code blocks present?) - Sycophancy detection ("Great question!" markers) - Filler/hedge word density ("basically", "perhaps", "I think") - Required section verification - Style consistency (sentence length variation)

- Manual mode: Interactive rubric-guided human scoring
Blended mode: Combine auto scores with human judgment (averaged)
Aggregate scoring with configurable method (weighted average, minimum, geometric mean)

3. Track (`scorecard_track.py`)

- Append every evaluation to a JSONL history file
Filter by agent, task type, time period
Compute trends per dimension (improving, degrading, stable)
Linear regression slope for quantified direction
Sparkline visualisations in terminal

4. Compare (`scorecard_track.py`)

- Before/after comparison (last N evals vs previous N)
Per-dimension delta with direction indicators
Perfect for measuring the impact of config changes

5. Report (`scorecard_report.py`)

- Single evaluation reports (markdown or JSON)
History summary reports with tables and sparklines
Per-dimension breakdowns with rubric reference
Export to files or stdout

Quick Start

CODEBLOCK0

Programmatic Usage

CODEBLOCK1

Use Cases

- Prompt engineering: Measure whether prompt changes improve output quality
Model comparison: Same task, different models — which scores higher?
Agent regression testing: Catch quality degradation before it ships
Team quality standards: Define shared rubrics for consistent evaluation
Continuous monitoring: Track quality trends over days/weeks/months
A/B testing: Quantified before/after comparisons

What's Included

File	Purpose
INLINECODE5	Main evaluation engine — define, evaluate, score
INLINECODE6

Requirements

- Python 3.8+
No external dependencies (stdlib only)
Works on any OS
Platform-agnostic (works with any AI agent framework)

Configuration

See config_example.json for the complete reference. Key areas:

- DIMENSIONS — Quality dimensions with rubrics, weights, thresholds, and auto-checks
AUTO_CHECKS — Tuning for each pattern-based check (markers, thresholds, penalties)
AGGREGATE_METHOD — How to combine dimension scores ("weightedaverage", "minimum", "geometricmean")
HISTORY_FILE — Where to store evaluation history
REPORT_OUTPUT_DIR — Where reports are saved

quality-verified

License

MIT — See LICENSE file.

⚠️ Security Note — Config File

Configuration is loaded from a JSON file. This is safe to share — no code execution.

- Config path is validated for existence and size (1MB cap) before loading
Must be a .json file — raises ValueError if given a non-JSON path
Keep your config under version control; it defines your quality rubrics and scoring weights

⚠️ Disclaimer

This software is provided "AS IS", without warranty of any kind, express or implied.

USE AT YOUR OWN RISK.

- The author(s) are NOT liable for any damages, losses, or consequences arising from

the use or misuse of this software — including but not limited to financial loss, data loss, security breaches, business interruption, or any indirect/consequential damages.

- This software does NOT constitute financial, legal, trading, or professional advice.
Users are solely responsible for evaluating whether this software is suitable for

their use case, environment, and risk tolerance.

- No guarantee is made regarding accuracy, reliability, completeness, or fitness

for any particular purpose.

- The author(s) are not responsible for how third parties use, modify, or distribute

this software after purchase.

By downloading, installing, or using this software, you acknowledge that you have read
this disclaimer and agree to use the software entirely at your own risk.

DATA DISCLAIMER: This software processes and stores data locally on your system.
The author(s) are not responsible for data loss, corruption, or unauthorized access
resulting from software bugs, system failures, or user error. Always maintain
independent backups of important data. This software does not transmit data externally
unless explicitly configured by the user.

Support & Links


🐛 Bug Reports	TheShadowyRose@proton.me
☕ Ko-fi

Built with OpenClaw — thank you for making this possible.

🛠️ Need something custom? Custom OpenClaw agents & skills starting at $500. If you can describe it, I can build it. → Hire me on Fiverr

智能体评分卡输出质量框架

可配置的AI智能体输出质量评估方案。定义标准、执行评估、追踪质量变化趋势。无需LLM作为评判者，无需API调用，基于模式的自动化检查。

可配置的AI智能体输出质量评估方案。定义标准、执行评估、追踪质量变化趋势。

智能体评分卡为您提供了一种结构化、可重复的方法来衡量AI智能体是否产生优质输出——以及其质量是随时间提升还是下降。无需LLM作为评判者，无需API调用，无外部依赖。所有操作均在本地运行，采用基于模式的自动化检查，并支持可选的人工评分。

问题所在

您修改了智能体的系统提示词。输出质量是否有所改善？您无从知晓。您添加了新工具。响应质量是否下降？您有感觉，但没有数据。AI智能体的质量管理大多依赖直觉。

智能体评分卡用数据取代直觉。

功能概述

1. 定义质量维度 (config_example.json)

- 配置质量对您的用例意味着什么
设置维度：准确性、完整性、语气、格式合规性、一致性——或您自定义的维度
定义评分标准（每个维度1分与5分分别代表什么？）
设置权重（准确性比语气更重要？赋予2倍权重）
设置每个维度的通过/失败阈值

2. 评估 (scorecard.py)

- 自动模式： 基于模式的检查即时运行，零API调用

- 响应长度分析（过短？过长？） - 格式合规性（预期的标题、列表、代码块是否存在？） - 谄媚检测（好问题！标记） - 填充/模糊词密度（基本上、或许、我认为） - 必需章节验证 - 风格一致性（句子长度变化）

- 手动模式： 交互式评分标准引导的人工评分
混合模式： 结合自动评分与人工判断（取平均值）
可配置的聚合评分方法（加权平均、最小值、几何平均）

3. 追踪 (scorecard_track.py)

- 将每次评估追加到JSONL历史文件中
按智能体、任务类型、时间段筛选
计算每个维度的趋势（改善、退化、稳定）
线性回归斜率用于量化方向
终端中的迷你趋势图可视化

4. 比较 (scorecard_track.py)

- 前后对比（最近N次评估 vs 前N次）
每个维度的差异值及方向指示符
非常适合衡量配置变更的影响

5. 报告 (scorecard_report.py)

- 单次评估报告（Markdown或JSON格式）
带表格和迷你趋势图的历史摘要报告
带评分标准参考的每个维度详细分解
导出到文件或标准输出

快速开始

bash

1. 配置

cp configexample.json scorecardconfig.json

根据您的用例编辑维度、阈值和权重

2. 评估响应

python3 scorecard.py --config scorecard_config.json --input response.txt

3. 评估并保存到历史记录

python3 scorecard.py --config scorecard_config.json --input response.txt --save history.jsonl

4. 手动评分模式

python3 scorecard.py --config scorecard_config.json --input response.txt --manual --save history.jsonl

5. 查看趋势

python3 scorecard_track.py --history history.jsonl --summary

6. 前后对比（最近10次 vs 前10次）

python3 scorecard_track.py --history history.jsonl --compare 10

7. 生成报告

python3 scorecardreport.py --config scorecardconfig.json --history history.jsonl

编程使用

python
from scorecard import Scorecard, loadconfig

cfg = loadconfig(scorecard_config.json)
sc = Scorecard(cfg)

text = open(agent_response.txt).read()
result = sc.evaluate(text, agent=my-agent, task_type=code-review)

print(result.summary())

总体：3.85/5（通过）

✓ 准确性：4.0/5（阈值3，权重2.0）[自动]

✓ 完整性：3.5/5（阈值3，权重1.5）[自动]

...

保存用于追踪

import json with open(history.jsonl, a) as f: f.write(json.dumps(result.to_dict()) + \n)

使用场景

- 提示词工程： 衡量提示词变更是否改善输出质量
模型比较： 相同任务，不同模型——哪个得分更高？
智能体回归测试： 在质量退化上线前及时发现
团队质量标准： 定义共享评分标准以实现一致评估
持续监控： 追踪数天/数周/数月的质量趋势
A/B测试： 量化的前后对比

包含内容

文件	用途
scorecard.py	主评估引擎——定义、评估、评分
scorecard_track.py

系统要求

- Python 3.8+
无外部依赖（仅使用标准库）
支持任何操作系统
平台无关（适用于任何AI智能体框架）

配置说明

请参阅 config_example.json 获取完整参考。关键部分：

- DIMENSIONS — 带评分标准、权重、阈值和自动检查的质量维度
AUTOCHECKS — 每个基于模式检查的调优（标记、阈值、惩罚）
AGGREGATEMETHOD — 如何组合维度分数（加权平均、最小值、几何平均）
HISTORYFILE — 评估历史记录的存储位置
REPORTOUTPUT_DIR — 报告的保存位置

质量已验证

许可证

MIT — 请参阅 LICENSE 文件。

⚠️ 安全说明 — 配置文件

配置从JSON文件加载。可安全共享——不会执行代码。

- 加载前会验证配置路径的存在性和大小（上限1MB）
必须是 .json 文件——如果提供非JSON路径将引发 ValueError
请将配置纳入版本控制；它定义了您的质量评分标准和评分权重

⚠️ 免责声明

本软件按原样提供，不附带任何明示或暗示的担保。

使用风险自负。

- 作者不对因使用或滥用本软件而产生的任何损害、损失或后果承担责任——包括但不限于财务损失、数据丢失、安全漏洞、业务中断或任何间接/后果性损害。
本软件不构成财务、法律、交易或专业建议。
用户全权负责评估本软件是否适合其用例、环境和风险承受能力。
不对准确性、可靠性、完整性或适用于任何特定目的作出任何保证。
作者不对第三方在购买后如何使用、修改或分发本软件负责。

下载、安装或使用本软件即表示您已阅读本免责声明并同意完全自担风险使用本软件。

数据免责声明： 本软件在您的系统上本地处理并存储数据。作者不对因软件错误、系统故障或用户错误导致的数据丢失、损坏或未经授权访问承担责任。请始终对重要数据进行独立备份。除非用户明确配置，否则本软件不会将数据传输到外部。

支持与链接


🐛 错误报告	TheShadowyRose@proton.me
☕ Ko-fi

基于 OpenClaw 构建——感谢您让这一切成为可能。

🛠️ 需要定制方案？ 自定义OpenClaw智能体和技能，起价500美元。只要您能描述，我就能构建。→ 在Fiverr上雇佣我

Agent Scorecard Output Quality Framework" 智能体输出质量框架