LLM Judge

Compare code implementations across multiple repositories using structured evaluation.

Usage

CODEBLOCK0

Arguments

Argument	Required	Description
INLINECODE0	Yes	Path to spec/requirements document
INLINECODE1

Workflow

1. Parse $ARGUMENTS into spec_path, repo_paths, labels, weights, and branch.
Validate the spec file, each repo path, and the minimum repo count.
Read the spec document into memory.
Load this skill and the supporting reference files.
Spawn one Phase 1 repo agent per repository to gather facts only.
Validate the repo-agent JSON results before proceeding.
Spawn one Phase 2 judge agent per dimension.
Aggregate scores, compute weighted totals, rank repos, and write the report.
Display the markdown summary and verify the JSON report.

Command Workflow

Step 1: Parse Arguments

Parse $ARGUMENTS to extract:

- spec_path: first positional argument
INLINECODE15: remaining positional arguments (must be 2+)
INLINECODE16: from --labels or derived from directory names
INLINECODE18: from --weights or defaults
INLINECODE20: from --branch or INLINECODE22

Default Weights:

CODEBLOCK1

Step 2: Validate Inputs

CODEBLOCK2

Step 3: Read Spec Document

CODEBLOCK3

Step 4: Load the Skill

Load the llm-judge skill: INLINECODE23

Step 5: Phase 1 - Spawn Repo Agents

Spawn one Task per repo:

CODEBLOCK4

Collect all repo outputs into ALL_FACTS.

Step 6: Validate Phase 1 Results

CODEBLOCK5

Step 7: Phase 2 - Spawn Judge Agents

Spawn five judge agents, one per dimension:

CODEBLOCK6

Step 8: Aggregate Scores

CODEBLOCK7

Step 9: Generate Verdict

Name the winner, explain why they won, and note any close calls or trade-offs.

Step 10: Write JSON Report

CODEBLOCK8

Write .beagle/llm-judge-report.json with version, timestamp, repo metadata, weights, scores, ranking, and verdict.

Step 11: Display Summary

Render a markdown summary with the scores table, ranking, verdict, and detailed justifications.

Step 12: Verification

CODEBLOCK9

Output Shape

The generated report should include:

- repo labels and paths
per-dimension scores and justifications
weighted totals and ranking
a verdict explaining the winner

Reference Files

File	Purpose
references/fact-schema.md	JSON schema for Phase 1 facts
references/scoring-rubrics.md

Scoring Model

Dimension	Default Weight	Evaluates
Functionality	30%	Spec compliance, test pass rate
Security

Scoring Scale

Score	Meaning
5	Excellent - Exceeds expectations
4

Good - Meets requirements, minor issues | | 3 | Average - Functional but notable gaps | | 2 | Below Average - Significant issues | | 1 | Poor - Fails basic requirements |

Phase 1: Spawning Repo Agents

For each repository, spawn a Task agent with:

CODEBLOCK10

Collect all repo-agent outputs into ALL_FACTS.

Phase 2: Spawning Judge Agents

After all Phase 1 agents complete, spawn 5 judge agents, one per dimension:

CODEBLOCK11

Aggregation

1. Collect the five judge outputs.
Compute each repo's weighted total with the configured weights.
Rank repos by weighted total in descending order.
Generate a verdict that explains the result and any close calls.
Write .beagle/llm-judge-report.json.

Output

Display a markdown summary with scores, ranking, verdict, and detailed justifications.

Verification

Before completing:

1. Verify .beagle/llm-judge-report.json exists and is valid JSON.
Verify all repos have scores for all dimensions.
Verify weighted totals sum correctly.

Rules

- Always validate inputs before proceeding
Spawn Phase 1 agents in parallel, then wait before Phase 2
Spawn Phase 2 agents in parallel, one per dimension
Every score must have a justification
Write the JSON report before displaying the summary

LLM Judge

使用结构化评估比较多个仓库中的代码实现。

用法

bash
/beagle-analysis:llm-judge <规格文件> <仓库1> <仓库2> [仓库3...] [--labels=...] [--weights=...] [--branch=...]

参数

参数	必需	描述
spec	是	规格/需求文档的路径
repos

工作流程

1. 将 $ARGUMENTS 解析为 specpath、repopaths、labels、weights 和 branch。
验证规格文件、每个仓库路径以及最小仓库数量。
将规格文档读入内存。
加载此技能及配套的参考文件。
为每个仓库生成一个第一阶段仓库代理，仅收集事实。
在继续之前验证仓库代理的 JSON 结果。
为每个维度生成一个第二阶段评判代理。
汇总分数，计算加权总分，对仓库进行排名，并撰写报告。
显示 Markdown 摘要并验证 JSON 报告。

命令工作流程

步骤 1：解析参数

解析 $ARGUMENTS 以提取：

- specpath：第一个位置参数
repopaths：剩余的位置参数（必须为 2 个及以上）
labels：来自 --labels 或从目录名派生
weights：来自 --weights 或使用默认值
branch：来自 --branch 或 main

默认权重：

json
{
functionality: 30,
security: 25,
tests: 20,
overengineering: 15,
dead_code: 10
}

步骤 2：验证输入

bash
[ -f $SPECPATH ] || { echo 错误：未找到规格文件：$SPECPATH; exit 1; }

for repo in ${REPO_PATHS[@]}; do
[ -d $repo/.git ] || { echo 错误：不是 Git 仓库：$repo; exit 1; }
done

[ ${#REPO_PATHS[@]} -ge 2 ] || { echo 错误：至少需要 2 个仓库进行比较; exit 1; }

步骤 3：读取规格文档

bash
SPECCONTENT=$(cat $SPECPATH) || { echo 错误：读取规格文件失败：$SPEC_PATH; exit 1; }
[ -z $SPECCONTENT ] && { echo 错误：规格文件为空：$SPECPATH; exit 1; }

步骤 4：加载技能

加载 llm-judge 技能：Skill(skill: beagle-analysis:llm-judge)

步骤 5：第一阶段 - 生成仓库代理

为每个仓库生成一个任务：

text
你是 LLM Judge 评估的第一阶段仓库代理。

你的仓库： $LABEL 位于 $REPO_PATH

规格文档：
$SPEC_CONTENT

指令：

1. 加载技能：Skill(skill: beagle-analysis:llm-judge)
阅读 references/repo-agent.md 获取详细指令
阅读 references/fact-schema.md 了解输出格式
加载 Skill(skill: beagle-core:llm-artifacts-detection) 进行分析

探索仓库并收集事实。仅返回符合事实模式的合法 JSON。

不要评分或评判。仅收集事实。

将所有仓库输出收集到 ALL_FACTS。

步骤 6：验证第一阶段结果

bash
echo $FACTS | python3 -c import json,sys; json.load(sys.stdin) 2>/dev/null || { echo 错误：来自 $LABEL 的 JSON 无效; exit 1; }

步骤 7：第二阶段 - 生成评判代理

生成五个评判代理，每个维度一个：

text
你是 LLM Judge 评估的 $DIMENSION 评判。

规格文档：
$SPEC_CONTENT

来自所有仓库的事实：
$ALLFACTSJSON

指令：

1. 加载技能：Skill(skill: beagle-analysis:llm-judge)
阅读 references/judge-agents.md 获取详细指令
阅读 references/scoring-rubrics.md 了解 $DIMENSION 评分标准

对每个仓库在 $DIMENSION 维度上评分。仅返回包含分数和理由的合法 JSON。

步骤 8：汇总分数

python
for repo_label in labels:
scores[repo_label] = {}
for dimension in dimensions:
scores[repolabel][dimension] = judgeoutputs[dimension][scores][repo_label]

weighted_total = sum(
scores[repo_label][dim][score] * weights[dim] / 100
for dim in dimensions
)
scores[repolabel][weightedtotal] = round(weighted_total, 2)

ranking = sorted(labels, key=lambda l: scores[l][weighted_total], reverse=True)

步骤 9：生成裁决

命名获胜者，解释其获胜原因，并注明任何接近的竞争或权衡。

步骤 10：写入 JSON 报告

bash
mkdir -p .beagle

写入 .beagle/llm-judge-report.json，包含版本、时间戳、仓库元数据、权重、分数、排名和裁决。

步骤 11：显示摘要

渲染一个 Markdown 摘要，包含分数表、排名、裁决和详细理由。

步骤 12：验证

bash
python3 -c import json; json.load(open(.beagle/llm-judge-report.json)) && echo 报告有效

输出结构

生成的报告应包含：

- 仓库标签和路径
每个维度的分数和理由
加权总分和排名
解释获胜者的裁决

参考文件

文件	用途
references/fact-schema.md	第一阶段事实的 JSON 模式
references/scoring-rubrics.md

评分模型

维度	默认权重	评估内容
功能性	30%	规格符合度，测试通过率
安全性

25% | 漏洞，安全模式 | | 测试质量 | 20% | 覆盖率，DRY 原则，模拟边界 | | 过度工程 | 15% | 不必要的复杂性 | | 死代码 | 10% | 未使用的代码，待办事项 |

评分标准

分数	含义
5	优秀 - 超出预期
4

良好 - 满足要求，有小问题 | | 3 | 一般 - 功能正常但有明显差距 | | 2 | 低于平均 - 存在重大问题 | | 1 | 差 - 未通过基本要求 |

第一阶段：生成仓库代理

为每个仓库生成一个任务代理，使用：

text
你是 LLM Judge 评估的第一阶段仓库代理。

你的仓库： $REPOLABEL 位于 $REPOPATH
规格文档：
$SPEC_CONTENT

指令： 阅读 @beagle:llm-judge references/repo-agent.md

收集事实并返回一个遵循 references/fact-schema.md 中模式的 JSON 对象。

加载 @beagle:llm-artifacts-detection 进行死代码和过度工程分析。

仅返回合法的 JSON，不要 Markdown 或解释。

将所有仓库代理输出收集到 ALL_FACTS。

第二阶段：生成评判代理

在所有第一阶段代理完成后，生成 5 个评判代理，每个维度一个：

text
你是 LLM Judge 评估的 $DIMENSION 评判。

规格文档：
$SPEC_CONTENT

来自所有仓库的事实：
$ALLFACTSJSON

指令： 阅读 @beagle:llm-judge references/judge-agents.md

使用 references/scoring-rubrics.md 中的评分标准对每个仓库在 $DIMENSION 维度上评分。

仅返回遵循评判输出模式的合法 JSON。

llm-judgeLLM评判

llm-judge

LLM Judge

Usage

Arguments

Workflow

Command Workflow

Step 1: Parse Arguments

Step 2: Validate Inputs

Step 3: Read Spec Document

Step 4: Load the Skill

Step 5: Phase 1 - Spawn Repo Agents

Step 6: Validate Phase 1 Results

Step 7: Phase 2 - Spawn Judge Agents

Step 8: Aggregate Scores

Step 9: Generate Verdict

Step 10: Write JSON Report

Step 11: Display Summary

Step 12: Verification

Output Shape

Reference Files

Scoring Model

Scoring Scale

Phase 1: Spawning Repo Agents

Phase 2: Spawning Judge Agents

Aggregation

Output

Verification

Rules

LLM Judge

用法

参数

工作流程

命令工作流程

步骤 1：解析参数

步骤 2：验证输入

步骤 3：读取规格文档

步骤 4：加载技能

步骤 5：第一阶段 - 生成仓库代理

步骤 6：验证第一阶段结果

步骤 7：第二阶段 - 生成评判代理

步骤 8：汇总分数

步骤 9：生成裁决

步骤 10：写入 JSON 报告

步骤 11：显示摘要

步骤 12：验证

输出结构

参考文件

评分模型

评分标准

第一阶段：生成仓库代理

第二阶段：生成评判代理

汇总

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement