PinchBench Leaderboard
Fetches and formats the PinchBench leaderboard — AI agent benchmarks for LLMs on standardized OpenClaw coding tasks.
Workflow
1. Determine the query
Map the user's intent to script flags:
| User intent | Flags |
|---|
| "Show the leaderboard" / default | INLINECODE0 |
| "Top 5 models" |
--top 5 |
| "How does Claude perform?" |
--model claude |
| "Cheapest models" |
--sort cost --top 10 |
| "Fastest models" |
--sort time --top 10 |
| "Compare Gemini and Claude" | Run twice with
--model gemini and
--model claude, present side by side |
| "Full leaderboard" |
--top 50 |
2. Run the script
CODEBLOCK0
Available flags:
- -
--top N — number of models to show (default: 10) - INLINECODE9 — sort by
score, cost, time, or runs (default: score) - INLINECODE14 — filter models containing this string (case-insensitive)
- INLINECODE15 — output raw JSON for further processing
3. Format the response
Present the output as-is in a code block. Add a brief one-line insight after the table:
- - Highlight the top performer and its score
- If the user asked about a specific model, comment on its ranking relative to the field
- If sorting by cost, note the best value (score/cost ratio)
4. Error handling
- - If the script fails with a curl error → report the error, suggest checking network connectivity
- If the script fails to parse data → the site structure may have changed, inform the user
- If no models match the filter → say so and suggest a broader search
Examples
| User says | Flags | Expected behavior |
|---|
| "Show me the PinchBench leaderboard" | INLINECODE16 | Show top 10 by score |
| "Which model is cheapest for OpenClaw?" |
--sort cost --top 10 | Show top 10 sorted by cost |
| "How does Claude compare to GPT?" |
--model claude then
--model gpt | Show both, compare |
| "What's the fastest model on PinchBench?" |
--sort time --top 5 | Show top 5 by execution time |
PinchBench排行榜
获取并格式化PinchBench排行榜——针对LLM在标准化OpenClaw编码任务上的AI智能体基准测试。
工作流程
1. 确定查询内容
将用户意图映射为脚本参数:
| 用户意图 | 参数 |
|---|
| 显示排行榜 / 默认 | --top 10 |
| 前5名模型 |
--top 5 |
| Claude表现如何? | --model claude |
| 最便宜的模型 | --sort cost --top 10 |
| 最快的模型 | --sort time --top 10 |
| 比较Gemini和Claude | 分别使用--model gemini和--model claude运行两次,并排展示 |
| 完整排行榜 | --top 50 |
2. 运行脚本
json
{
tool: exec,
command: python3 {baseDir}/scripts/fetch_leaderboard.py --top 10
}
可用参数:
- - --top N — 显示的模型数量(默认:10)
- --sort metric — 按score、cost、time或runs排序(默认:score)
- --model filter — 筛选包含此字符串的模型(不区分大小写)
- --json — 输出原始JSON以供进一步处理
3. 格式化响应
在代码块中按原样呈现输出。在表格后添加简短的一行见解:
- - 突出显示最佳表现者及其得分
- 如果用户询问特定模型,评论其相对于整体的排名
- 如果按成本排序,注明最佳性价比(得分/成本比)
4. 错误处理
- - 如果脚本因curl错误失败 → 报告错误,建议检查网络连接
- 如果脚本无法解析数据 → 网站结构可能已更改,告知用户
- 如果没有模型匹配筛选条件 → 说明情况并建议扩大搜索范围
示例
| 用户输入 | 参数 | 预期行为 |
|---|
| 显示PinchBench排行榜 | --top 10 | 按得分显示前10名 |
| 哪个模型在OpenClaw上最便宜? |
--sort cost --top 10 | 按成本排序显示前10名 |
| Claude与GPT相比如何? | --model claude然后--model gpt | 显示两者并进行比较 |
| PinchBench上最快的模型是什么? | --sort time --top 5 | 按执行时间显示前5名 |