LLM Judge
Compare code implementations across multiple repositories using structured evaluation.
Usage
CODEBLOCK0
Arguments
| Argument | Required | Description |
|---|
| INLINECODE0 | Yes | Path to spec/requirements document |
| INLINECODE1 |
Yes | 2+ paths to repositories to compare |
|
--labels | No | Comma-separated labels (default: directory names) |
|
--weights | No | Override weights, e.g.
functionality:40,security:30 |
|
--branch | No | Branch to compare against main (default:
main) |
Workflow
- 1. Parse
$ARGUMENTS into spec_path, repo_paths, labels, weights, and branch. - Validate the spec file, each repo path, and the minimum repo count.
- Read the spec document into memory.
- Load this skill and the supporting reference files.
- Spawn one Phase 1 repo agent per repository to gather facts only.
- Validate the repo-agent JSON results before proceeding.
- Spawn one Phase 2 judge agent per dimension.
- Aggregate scores, compute weighted totals, rank repos, and write the report.
- Display the markdown summary and verify the JSON report.
Command Workflow
Step 1: Parse Arguments
Parse $ARGUMENTS to extract:
- -
spec_path: first positional argument - INLINECODE15 : remaining positional arguments (must be 2+)
- INLINECODE16 : from
--labels or derived from directory names - INLINECODE18 : from
--weights or defaults - INLINECODE20 : from
--branch or INLINECODE22
Default Weights:
CODEBLOCK1
Step 2: Validate Inputs
CODEBLOCK2
Step 3: Read Spec Document
CODEBLOCK3
Step 4: Load the Skill
Load the llm-judge skill: INLINECODE23
Step 5: Phase 1 - Spawn Repo Agents
Spawn one Task per repo:
CODEBLOCK4
Collect all repo outputs into ALL_FACTS.
Step 6: Validate Phase 1 Results
CODEBLOCK5
Step 7: Phase 2 - Spawn Judge Agents
Spawn five judge agents, one per dimension:
CODEBLOCK6
Step 8: Aggregate Scores
CODEBLOCK7
Step 9: Generate Verdict
Name the winner, explain why they won, and note any close calls or trade-offs.
Step 10: Write JSON Report
CODEBLOCK8
Write .beagle/llm-judge-report.json with version, timestamp, repo metadata, weights, scores, ranking, and verdict.
Step 11: Display Summary
Render a markdown summary with the scores table, ranking, verdict, and detailed justifications.
Step 12: Verification
CODEBLOCK9
Output Shape
The generated report should include:
- - repo labels and paths
- per-dimension scores and justifications
- weighted totals and ranking
- a verdict explaining the winner
Reference Files
Detailed rubrics for each dimension |
|
references/repo-agent.md | Instructions for Phase 1 agents |
|
references/judge-agents.md | Instructions for Phase 2 judges |
Scoring Model
| Dimension | Default Weight | Evaluates |
|---|
| Functionality | 30% | Spec compliance, test pass rate |
| Security |
25% | Vulnerabilities, security patterns |
| Test Quality | 20% | Coverage, DRY, mock boundaries |
| Overengineering | 15% | Unnecessary complexity |
| Dead Code | 10% | Unused code, TODOs |
Scoring Scale
| Score | Meaning |
|---|
| 5 | Excellent - Exceeds expectations |
| 4 |
Good - Meets requirements, minor issues |
| 3 | Average - Functional but notable gaps |
| 2 | Below Average - Significant issues |
| 1 | Poor - Fails basic requirements |
Phase 1: Spawning Repo Agents
For each repository, spawn a Task agent with:
CODEBLOCK10
Collect all repo-agent outputs into ALL_FACTS.
Phase 2: Spawning Judge Agents
After all Phase 1 agents complete, spawn 5 judge agents, one per dimension:
CODEBLOCK11
Aggregation
- 1. Collect the five judge outputs.
- Compute each repo's weighted total with the configured weights.
- Rank repos by weighted total in descending order.
- Generate a verdict that explains the result and any close calls.
- Write
.beagle/llm-judge-report.json.
Output
Display a markdown summary with scores, ranking, verdict, and detailed justifications.
Verification
Before completing:
- 1. Verify
.beagle/llm-judge-report.json exists and is valid JSON. - Verify all repos have scores for all dimensions.
- Verify weighted totals sum correctly.
Rules
- - Always validate inputs before proceeding
- Spawn Phase 1 agents in parallel, then wait before Phase 2
- Spawn Phase 2 agents in parallel, one per dimension
- Every score must have a justification
- Write the JSON report before displaying the summary
LLM Judge
使用结构化评估比较多个仓库中的代码实现。
用法
bash
/beagle-analysis:llm-judge <规格文件> <仓库1> <仓库2> [仓库3...] [--labels=...] [--weights=...] [--branch=...]
参数
| 参数 | 必需 | 描述 |
|---|
| spec | 是 | 规格/需求文档的路径 |
| repos |
是 | 2个及以上待比较的仓库路径 |
| --labels | 否 | 逗号分隔的标签(默认:目录名) |
| --weights | 否 | 覆盖权重,例如 functionality:40,security:30 |
| --branch | 否 | 与主分支比较的分支(默认:main) |
工作流程
- 1. 将 $ARGUMENTS 解析为 specpath、repopaths、labels、weights 和 branch。
- 验证规格文件、每个仓库路径以及最小仓库数量。
- 将规格文档读入内存。
- 加载此技能及配套的参考文件。
- 为每个仓库生成一个第一阶段仓库代理,仅收集事实。
- 在继续之前验证仓库代理的 JSON 结果。
- 为每个维度生成一个第二阶段评判代理。
- 汇总分数,计算加权总分,对仓库进行排名,并撰写报告。
- 显示 Markdown 摘要并验证 JSON 报告。
命令工作流程
步骤 1:解析参数
解析 $ARGUMENTS 以提取:
- - specpath:第一个位置参数
- repopaths:剩余的位置参数(必须为 2 个及以上)
- labels:来自 --labels 或从目录名派生
- weights:来自 --weights 或使用默认值
- branch:来自 --branch 或 main
默认权重:
json
{
functionality: 30,
security: 25,
tests: 20,
overengineering: 15,
dead_code: 10
}
步骤 2:验证输入
bash
[ -f $SPECPATH ] || { echo 错误:未找到规格文件:$SPECPATH; exit 1; }
for repo in ${REPO_PATHS[@]}; do
[ -d $repo/.git ] || { echo 错误:不是 Git 仓库:$repo; exit 1; }
done
[ ${#REPO_PATHS[@]} -ge 2 ] || { echo 错误:至少需要 2 个仓库进行比较; exit 1; }
步骤 3:读取规格文档
bash
SPECCONTENT=$(cat $SPECPATH) || { echo 错误:读取规格文件失败:$SPEC_PATH; exit 1; }
[ -z $SPECCONTENT ] && { echo 错误:规格文件为空:$SPECPATH; exit 1; }
步骤 4:加载技能
加载 llm-judge 技能:Skill(skill: beagle-analysis:llm-judge)
步骤 5:第一阶段 - 生成仓库代理
为每个仓库生成一个任务:
text
你是 LLM Judge 评估的第一阶段仓库代理。
你的仓库: $LABEL 位于 $REPO_PATH
规格文档:
$SPEC_CONTENT
指令:
- 1. 加载技能:Skill(skill: beagle-analysis:llm-judge)
- 阅读 references/repo-agent.md 获取详细指令
- 阅读 references/fact-schema.md 了解输出格式
- 加载 Skill(skill: beagle-core:llm-artifacts-detection) 进行分析
探索仓库并收集事实。仅返回符合事实模式的合法 JSON。
不要评分或评判。仅收集事实。
将所有仓库输出收集到 ALL_FACTS。
步骤 6:验证第一阶段结果
bash
echo $FACTS | python3 -c import json,sys; json.load(sys.stdin) 2>/dev/null || { echo 错误:来自 $LABEL 的 JSON 无效; exit 1; }
步骤 7:第二阶段 - 生成评判代理
生成五个评判代理,每个维度一个:
text
你是 LLM Judge 评估的 $DIMENSION 评判。
规格文档:
$SPEC_CONTENT
来自所有仓库的事实:
$ALLFACTSJSON
指令:
- 1. 加载技能:Skill(skill: beagle-analysis:llm-judge)
- 阅读 references/judge-agents.md 获取详细指令
- 阅读 references/scoring-rubrics.md 了解 $DIMENSION 评分标准
对每个仓库在 $DIMENSION 维度上评分。仅返回包含分数和理由的合法 JSON。
步骤 8:汇总分数
python
for repo_label in labels:
scores[repo_label] = {}
for dimension in dimensions:
scores[repolabel][dimension] = judgeoutputs[dimension][scores][repo_label]
weighted_total = sum(
scores[repo_label][dim][score] * weights[dim] / 100
for dim in dimensions
)
scores[repolabel][weightedtotal] = round(weighted_total, 2)
ranking = sorted(labels, key=lambda l: scores[l][weighted_total], reverse=True)
步骤 9:生成裁决
命名获胜者,解释其获胜原因,并注明任何接近的竞争或权衡。
步骤 10:写入 JSON 报告
bash
mkdir -p .beagle
写入 .beagle/llm-judge-report.json,包含版本、时间戳、仓库元数据、权重、分数、排名和裁决。
步骤 11:显示摘要
渲染一个 Markdown 摘要,包含分数表、排名、裁决和详细理由。
步骤 12:验证
bash
python3 -c import json; json.load(open(.beagle/llm-judge-report.json)) && echo 报告有效
输出结构
生成的报告应包含:
- - 仓库标签和路径
- 每个维度的分数和理由
- 加权总分和排名
- 解释获胜者的裁决
参考文件
每个维度的详细评分标准 |
|
references/repo-agent.md | 第一阶段代理的指令 |
|
references/judge-agents.md | 第二阶段评判的指令 |
评分模型
| 维度 | 默认权重 | 评估内容 |
|---|
| 功能性 | 30% | 规格符合度,测试通过率 |
| 安全性 |
25% | 漏洞,安全模式 |
| 测试质量 | 20% | 覆盖率,DRY 原则,模拟边界 |
| 过度工程 | 15% | 不必要的复杂性 |
| 死代码 | 10% | 未使用的代码,待办事项 |
评分标准
良好 - 满足要求,有小问题 |
| 3 | 一般 - 功能正常但有明显差距 |
| 2 | 低于平均 - 存在重大问题 |
| 1 | 差 - 未通过基本要求 |
第一阶段:生成仓库代理
为每个仓库生成一个任务代理,使用:
text
你是 LLM Judge 评估的第一阶段仓库代理。
你的仓库: $REPOLABEL 位于 $REPOPATH
规格文档:
$SPEC_CONTENT
指令: 阅读 @beagle:llm-judge references/repo-agent.md
收集事实并返回一个遵循 references/fact-schema.md 中模式的 JSON 对象。
加载 @beagle:llm-artifacts-detection 进行死代码和过度工程分析。
仅返回合法的 JSON,不要 Markdown 或解释。
将所有仓库代理输出收集到 ALL_FACTS。
第二阶段:生成评判代理
在所有第一阶段代理完成后,生成 5 个评判代理,每个维度一个:
text
你是 LLM Judge 评估的 $DIMENSION 评判。
规格文档:
$SPEC_CONTENT
来自所有仓库的事实:
$ALLFACTSJSON
指令: 阅读 @beagle:llm-judge references/judge-agents.md
使用 references/scoring-rubrics.md 中的评分标准对每个仓库在 $DIMENSION 维度上评分。
仅返回遵循评判输出模式的合法 JSON。
汇总
1.