LLM Benchmark Analyst
Overview
Use this skill to research benchmark evidence and write structured reports about:
- 1. a single model's strengths and weaknesses
- best models in a capability domain
- what a benchmark measures and how trustworthy it is
- predecessor vs current-model progress
Default to the user's language. Never invent scores, ranks, dates, benchmark variants, or missing table values.
Core constraints
- - Restrict the benchmark universe to
references/benchmark-source.md. If a benchmark is not in that file, exclude it. - Use
references/core-dimensions.md to collapse scattered benchmarks into a small set of report dimensions. - Follow
references/search-playbook.md for routing, overlap expansion, evidence gathering, and comparison anchors. - Follow
references/report-template.md for output structure. - Apply
references/data-defect-warnings.md benchmark by benchmark, inline and again in the limitations section. - Prefer official benchmark or benchmark-author pages. Use aggregators mainly to discover links and context.
- Record the evaluation mode exactly: benchmark version, split, difficulty, public/private, verified/original, with-tools/without-tools, pass@k, and any visible sub-score names.
- Keep score units exact. Do not average incompatible metrics into a fake composite.
Required workflow
- 1. Normalize the model identity before searching
- Resolve exact provider, family, generation, version suffix, and release label.
- Put time and version first. Reject ambiguous aliases like
claude,
gemini pro,
gpt latest, or
qwen max until you have the exact currently relevant model string for the searched leaderboard rows.
- Capture the evaluation time point or access date for every key score.
- 2. Route the request through core dimensions before web crawling
- Start with
references/core-dimensions.md to select the primary dimension(s).
- Then list candidate benchmarks inside those dimensions.
- Only then start website-by-website retrieval.
- Keep the first pass narrow and token-efficient: start from the best 3-6 benchmarks for the asked domain, then expand only if needed.
- 3. Expand beyond section labels
- Do not let the source document's headings blind you.
- After selecting the primary dimension, inspect benchmark descriptions and overlap tags to find relevant benchmarks that live in other sections.
- Example: a coding analysis may need coding benchmarks, agentic coding benchmarks, general benchmarks with coding components, and research/math benchmarks with strong code components.
- Example: a multimodal analysis may need vision benchmarks, OCR, GUI/computer-use, multimodal deep-research, and omni/video/audio benchmarks.
- 4. Collect evidence in this order
- official leaderboard or benchmark site
- benchmark paper or benchmark README
- benchmark-author blog or release note
- trusted aggregator
- vendor blog only as secondary evidence, clearly labeled as vendor-reported if no independent leaderboard row exists
- 5. Use multimodal extraction when the leaderboard is not machine-readable
- If the page uses images, canvas, screenshots, or chart-only rendering and plain text extraction misses the table, inspect screenshots or page images.
- Extract only values that are clearly visible.
- Mark the provenance as
image-extracted.
- If the image is unreadable or partially occluded, say so instead of guessing.
- 6. Apply anchor comparisons
- For code or agentic coding, compare against the latest available Claude Opus, latest Claude Sonnet, and latest GPT family model.
- For multimodal analysis, compare against the latest available Gemini model. Add the latest GPT multimodal model if relevant.
- For intelligence or reasoning analysis, compare against the latest available GPT family model.
- Never assume which model is currently
latest. Search that first.
- 7. Apply predecessor comparison
- If data exists, compare the target model with its immediate predecessor or last broadly comparable prior generation from the same provider/family.
- Only compare like-for-like benchmark variants. If the predecessor only appears under a different benchmark mode, say the comparison is not clean.
- 8. Attach defect warnings
- Any benchmark with a known quality or methodology issue must carry an inline warning from
references/data-defect-warnings.md.
- If the report's conclusion depends heavily on warned benchmarks, lower confidence and say so explicitly.
Decision rules
- - When the user asks for
best models in a domain, do not use only one benchmark. Use a cluster of relevant benchmarks and explain why each one matters. - When the user asks for
what is this model good or bad at, synthesize at the core-dimension level first, then support with benchmark evidence. - When benchmark scores conflict, prefer freshness, exact version match, official source quality, and the number of agreeing benchmarks over one standout score.
- Treat very small gaps as non-decisive when the benchmark is noisy, image-extracted, or known to be unstable.
- Always include one short clause describing what each benchmark actually tests.
Minimum evidence to capture
For every benchmark you cite, capture:
- - benchmark name
- what it tests in one short phrase
- exact model row name
- exact score and unit
- rank or relative placement if visible
- benchmark variant, split, or mode
- date or access time point
- source quality note if not official
- data warning if applicable
Output expectations
Use the matching template in
references/report-template.md.
At minimum, every substantive report must include:
- - a scope and identity section
- a short executive summary
- strengths
- weaknesses or gaps
- evidence table
- comparison section
- data-defect warnings and confidence
- methodology or exclusions
Resource map
- -
references/core-dimensions.md: benchmark routing and de-fragmentation map - INLINECODE17 : token-efficient search order, overlap expansion, and comparison rules
- INLINECODE18 : warning catalog and ready-to-use caution language
- INLINECODE19 : output structures for single-model, domain-leader, and benchmark-explainer tasks
- INLINECODE20 : full allowed benchmark universe copied from the user's benchmark document
Example tasks
- - INLINECODE21
- INLINECODE22
- INLINECODE23
- INLINECODE24
LLM基准测试分析师
概述
使用此技能研究基准测试证据,并撰写关于以下内容的结构化报告:
- 1. 单个模型的优势与劣势
- 特定能力领域的最佳模型
- 基准测试的衡量内容及其可信度
- 前代模型与当前模型的进展对比
默认使用用户的语言。切勿编造分数、排名、日期、基准测试变体或缺失的表格数值。
核心约束
- - 将基准测试范围限制在 references/benchmark-source.md 中。如果某个基准测试不在该文件中,则排除它。
- 使用 references/core-dimensions.md 将分散的基准测试归纳为一小组报告维度。
- 遵循 references/search-playbook.md 进行路由、重叠扩展、证据收集和对比锚点。
- 遵循 references/report-template.md 确定输出结构。
- 逐项基准测试应用 references/data-defect-warnings.md,在行内和局限性部分中均需注明。
- 优先使用官方基准测试或基准测试作者的页面。聚合器主要用于发现链接和上下文。
- 精确记录评估模式:基准测试版本、数据划分、难度、公开/私有、验证/原始、使用工具/不使用工具、pass@k 以及任何可见的子分数名称。
- 保持分数单位精确。不要将不兼容的指标平均成虚假的综合分数。
必需的工作流程
- 1. 在搜索前标准化模型身份
- 确定确切的服务商、系列、代际、版本后缀和发布标签。
- 将时间和版本放在首位。拒绝模糊的别名,如 claude、gemini pro、gpt latest 或 qwen max,直到你获得所搜索排行榜行中确切且当前相关的模型字符串。
- 记录每个关键分数的评估时间点或访问日期。
- 2. 在网页抓取前通过核心维度路由请求
- 从 references/core-dimensions.md 开始,选择主要维度。
- 然后列出这些维度内的候选基准测试。
- 之后才开始逐个网站的检索。
- 保持第一轮检索范围狭窄且高效:从所询问领域的最佳 3-6 个基准测试开始,仅在需要时扩展。
- 3. 超越章节标签进行扩展
- 不要让源文档的标题限制你的视野。
- 选择主要维度后,检查基准测试描述和重叠标签,以查找位于其他章节中的相关基准测试。
- 示例:代码分析可能需要代码基准测试、智能体代码基准测试、包含代码组件的通用基准测试,以及包含强代码组件的研究/数学基准测试。
- 示例:多模态分析可能需要视觉基准测试、OCR、GUI/计算机使用、多模态深度研究以及全模态/视频/音频基准测试。
- 4. 按此顺序收集证据
- 官方排行榜或基准测试网站
- 基准测试论文或基准测试 README
- 基准测试作者的博客或发布说明
- 可信的聚合器
- 供应商博客仅作为次要证据,如果不存在独立的排行榜行,则需明确标注为供应商报告。
- 5. 当排行榜不可机器读取时使用多模态提取
- 如果页面使用图像、画布、截图或仅图表渲染,且纯文本提取遗漏了表格,则检查截图或页面图像。
- 仅提取清晰可见的数值。
- 将来源标记为 image-extracted。
- 如果图像不可读或部分遮挡,请如实说明,不要猜测。
- 6. 应用锚点对比
- 对于代码或智能体编码,与最新的可用 Claude Opus、最新的 Claude Sonnet 和最新的 GPT 系列模型进行比较。
- 对于多模态分析,与最新的可用 Gemini 模型进行比较。如果相关,添加最新的 GPT 多模态模型。
- 对于智能或推理分析,与最新的可用 GPT 系列模型进行比较。
- 切勿假设哪个模型当前是 latest。首先搜索确认。
- 7. 应用前代模型对比
- 如果数据存在,将目标模型与其直接前代模型或来自同一服务商/系列的上一代广泛可比的模型进行比较。
- 仅比较同类基准测试变体。如果前代模型仅出现在不同的基准测试模式下,则说明该比较不纯粹。
- 8. 附加缺陷警告
- 任何已知存在质量或方法论问题的基准测试,必须附带来自 references/data-defect-warnings.md 的行内警告。
- 如果报告的结论严重依赖于被警告的基准测试,则降低置信度并明确说明。
决策规则
- - 当用户询问“某个领域的最佳模型”时,不要仅使用一个基准测试。使用一组相关的基准测试,并解释每个基准测试的重要性。
- 当用户询问“这个模型擅长或不擅长什么”时,首先在核心维度层面进行综合,然后用基准测试证据支持。
- 当基准测试分数冲突时,优先考虑新鲜度、精确版本匹配、官方来源质量以及多个一致基准测试的数量,而非一个突出的分数。
- 当基准测试噪音大、通过图像提取或已知不稳定时,将非常小的差距视为非决定性因素。
- 始终包含一个简短的从句,描述每个基准测试实际测试的内容。
需要捕获的最低证据
对于你引用的每个基准测试,捕获:
- - 基准测试名称
- 用一个短语描述其测试内容
- 确切的模型行名称
- 确切的分数和单位
- 排名或相对位置(如果可见)
- 基准测试变体、数据划分或模式
- 日期或访问时间点
- 如果不是官方来源,则注明来源质量说明
- 数据警告(如适用)
输出期望
使用 references/report-template.md 中的匹配模板。
至少,每份实质性报告必须包括:
- - 范围和身份部分
- 简短执行摘要
- 优势
- 劣势或差距
- 证据表格
- 对比部分
- 数据缺陷警告和置信度
- 方法论或排除项
资源地图
- - references/core-dimensions.md:基准测试路由和碎片整理地图
- references/search-playbook.md:高效令牌搜索顺序、重叠扩展和对比规则
- references/data-defect-warnings.md:警告目录和即用型谨慎语言
- references/report-template.md:针对单模型、领域领先者和基准测试解释器任务的输出结构
- references/benchmark-source.md:从用户基准测试文档复制的完整允许基准测试范围
示例任务
- - 分析 gpt-5 的编码和智能体编码优势与劣势,并将其与最新的 claude opus、claude sonnet 和 gpt 模型进行比较
- 仅使用批准的基准测试列表查找当前最佳的多模态模型,并简要解释每个基准测试
- 撰写一份关于 qwen 推理优势、基准测试差距、前代模型对比以及所有数据质量注意事项的报告
- 告诉我哪些模型在深度研究和搜索方面领先,并附上基准测试特定的警告和新鲜度说明