autoresearch
Autonomous prompt optimization for AI agent skills. Runs controlled experiments to find better prompt variants using the Karpathy autoresearch pattern: generate hypothesis, mutate prompt, evaluate, repeat.
When to use
- - 用户说"优化一下这个 skill" / User says "optimize this skill's prompt"
- 用户要对比不同 prompt 版本的效果 / User wants to benchmark prompt variants
- 用户说"run autoresearch on X" / "eval skill X" / "improve skill X"
- 用户对 skill 输出质量不满,想系统性改进 / User is unhappy with skill output quality and wants systematic improvement
Do not use:
- - 一次性的小改动(直接改 prompt 即可) / One-off prompt tweaks — just edit the prompt directly
- 调试某个特定失败 case / Debugging a specific failure — investigate the root cause instead
- Skill 脚本本身有 bug(代码逻辑问题不是 prompt 问题) / Skill script has a bug — fix the code, not the prompt
Requirements
- - Python 3.10+
- INLINECODE0 script in the skill directory
- LLM API access (MiniMax, OpenAI, or Anthropic)
- Target skill must have a prompt file (SKILL.md, SYSTEM.md, or similar)
Procedure
Always follow these steps in order: (1) Create eval.json, (2) Run autoresearch command, (3) Review results and apply best prompt.
Step 1: Gather context
Before running, you need:
| Parameter | Description | Example |
|---|
| INLINECODE1 | Path to the skill directory or prompt file to optimize | INLINECODE2 |
| INLINECODE3 |
Path to eval definition JSON file |
eval.json |
|
--provider | LLM provider for running experiments |
minimax (default),
openai,
anthropic |
|
--runs | Number of runs per experiment (statistical significance) |
5 (default) |
|
--max-experiments | Maximum experiments before stopping |
30 (default) |
|
--dashboard | Open live results dashboard in browser | flag, no value |
Step 2: Create eval.json
Define test inputs and evaluation criteria. Each eval is a binary pass/fail check.
CODEBLOCK0
Rule types:
| Rule | Parameters | Description |
|---|
| INLINECODE14 | INLINECODE15 | Pass if regex matches output |
| INLINECODE16 |
phrases (list) | Pass if NONE of the phrases appear |
|
word_count |
min,
max (optional) | Pass if word count is within range |
|
contains |
values (list), optional
match:
"any" (default) or
"all" | Pass if any/all values appear in output (case-insensitive) |
|
not_contains |
values (list) | Pass if NONE of the values appear in output (case-insensitive) |
LLM eval type:
| Field | Description |
|---|
| INLINECODE28 | Must be INLINECODE29 |
| INLINECODE30 |
Unique name for this eval |
|
question | What to ask the judge LLM about the output |
|
pass_description | Description of what a passing output looks like |
|
fail_description | Description of what a failing output looks like |
See eval-guide.md for detailed guidance on writing effective evals.
Step 3: Run autoresearch
CODEBLOCK1
Step 4: Review results and apply changes
The script writes results to results.tsv in the working directory. Each row is one experiment:
CODEBLOCK2
Find the best performing variant:
CODEBLOCK3
Apply the winning prompt to your skill by copying the optimized prompt text to replace the original.
Example: optimizing brain-search
CODEBLOCK4
Failure handling
| Issue | Action |
|---|
| LLM API rate limit | Script auto-retries with backoff; if persistent, reduce INLINECODE36 |
| Target file not found |
Check path, must be readable prompt/skill file |
| All experiments score 0 | Evals may be too strict — review eval definitions, loosen criteria |
| Script crashes mid-run | Results already written to
results.tsv are preserved; re-run continues |
Gotchas
- - 每次实验会调用 LLM 多次(runs x testinputs x llmevals),注意 API 用量 / Each experiment makes multiple LLM calls — watch API usage
- LLM eval 本身有噪声,
--runs 设高一点(5+)才有统计意义 / LLM evals are noisy, use 5+ runs for statistical significance - Rule evals 比 LLM evals 更稳定、更便宜,优先用 rule / Rule evals are more stable and cheaper — prefer them
- Baseline 分数太低(< 20%)说明 eval 定义可能有问题,先修 eval / If baseline score is very low, fix evals first
- 优化 prompt 不能解决架构问题(比如搜索 API 本身返回差结果) / Prompt optimization cannot fix architectural issues
autoresearch
AI智能体技能的自主提示优化。通过运行受控实验,使用Karpathy自主研究模式寻找更好的提示变体:生成假设、变异提示、评估、重复。
使用时机
- - 用户说优化一下这个 skill
- 用户要对比不同 prompt 版本的效果
- 用户说run autoresearch on X / eval skill X / improve skill X
- 用户对 skill 输出质量不满,想系统性改进
不要使用:
- - 一次性的小改动(直接改 prompt 即可)
- 调试某个特定失败 case(应调查根本原因)
- Skill 脚本本身有 bug(代码逻辑问题不是 prompt 问题)
要求
- - Python 3.10+
- skill 目录中存在 autoresearch.py 脚本
- LLM API 访问权限(MiniMax、OpenAI 或 Anthropic)
- 目标 skill 必须包含提示文件(SKILL.md、SYSTEM.md 或类似文件)
流程
始终按以下顺序执行步骤:(1) 创建 eval.json,(2) 运行 autoresearch 命令,(3) 审查结果并应用最佳提示。
步骤 1:收集上下文
运行前,需要以下参数:
| 参数 | 描述 | 示例 |
|---|
| --target | 要优化的 skill 目录或提示文件路径 | ../workspace/skills/brain-search/SKILL.md |
| --evals |
评估定义 JSON 文件路径 | eval.json |
| --provider | 运行实验的 LLM 提供商 | minimax(默认)、openai、anthropic |
| --runs | 每次实验的运行次数(统计显著性) | 5(默认) |
| --max-experiments | 停止前的最大实验次数 | 30(默认) |
| --dashboard | 在浏览器中打开实时结果仪表盘 | 标志,无值 |
步骤 2:创建 eval.json
定义测试输入和评估标准。每个评估都是二元的通过/失败检查。
json
{
test_inputs: [
search for latest AI agent frameworks,
find news about LLM inference optimization,
搜一下 transformer 架构的最新进展
],
evals: [
{
name: has_sources,
type: rule,
rule: regex,
pattern: (https?://|Source:|来源:)
},
{
name: nohallucinatedurls,
type: rule,
rule: banned_phrases,
phrases: [example.com, placeholder.url]
},
{
name: sufficient_detail,
type: rule,
rule: word_count,
min: 50,
max: 500
},
{
name: contains_summary,
type: rule,
rule: contains,
values: [summary, key findings, 结论]
},
{
name: noapologyprefix,
type: rule,
rule: not_contains,
values: [I apologize, Im sorry, but]
},
{
name: actionable_output,
type: llm,
question: Does the response provide actionable information the user can immediately use (links, specific facts, concrete next steps)?,
pass_description: The response contains specific actionable items like URLs, concrete facts, or clear next steps,
fail_description: The response is vague, generic, or lacks specific actionable information
}
]
}
规则类型:
| 规则 | 参数 | 描述 |
|---|
| regex | pattern | 如果正则表达式匹配输出则通过 |
| banned_phrases |
phrases(列表) | 如果没有任何短语出现则通过 |
| word_count | min、max(可选) | 如果字数在范围内则通过 |
| contains | values(列表),可选 match:any(默认)或 all | 如果任何/所有值出现在输出中则通过(不区分大小写) |
| not_contains | values(列表) | 如果没有任何值出现在输出中则通过(不区分大小写) |
LLM 评估类型:
此评估的唯一名称 |
| question | 向评判 LLM 询问关于输出的问题 |
| pass_description | 通过输出的样貌描述 |
| fail_description | 失败输出的样貌描述 |
有关编写有效评估的详细指南,请参阅 eval-guide.md。
步骤 3:运行 autoresearch
bash
python autoresearch.py \
--target ../workspace/skills/brain-search/SKILL.md \
--evals eval.json \
--provider minimax \
--runs 5 \
--max-experiments 30 \
--dashboard
步骤 4:审查结果并应用更改
脚本将结果写入工作目录中的 results.tsv。每行代表一个实验:
experimentid parentid mutationdescription avgscore passrate evalsdetail prompt_diff
找到表现最佳的变体:
bash
cat results.tsv | sort -k4 -nr | head -5
将获胜提示应用到您的 skill,复制优化后的提示文本替换原始内容。
示例:优化 brain-search
User: brain-search 的搜索结果经常缺少来源链接,帮我优化一下
完整流程:
- 1. 创建 eval.json:
{
test_inputs: [
search for latest news on OpenAI,
搜一下最新的 AI 芯片进展,
find recent papers on RAG optimization,
what happened with Anthropic this week,
查查 GPU 价格趋势
],
evals: [
{
name: has_urls,
type: rule,
rule: regex,
pattern: https?://[^\\s]+
},
{
name: min
2sources,
type: rule,
rule: regex,
pattern: https?://[^\\s]+.*https?://[^\\s]+
},
{
name: structured_output,
type: llm,
question: Is the output well-structured with clear sections?,
pass_description: Output uses clear structure like bullets or headers,
fail_description: Output is a wall of text without clear structure
}
]
}
- 2. 运行命令:
python autoresearch.py \
--target ../workspace/skills/brain-search/SKILL.md \
--evals eval.json \
--runs 5 \
--max-experiments 20
- 3. 查看并应用结果:
- 检查 results.tsv 找最高分变体
- 查看 mutation_description 了解关键改动
- 将最佳 prompt 应用到原始 SKILL.md
失败处理
| 问题 | 操作 |
|---|
| LLM API 速率限制 | 脚本自动重试并退避;如果持续存在,减少 --runs |
| 目标文件未找到 |
检查路径,必须是可读的提示/skill 文件 |
| 所有实验得分为 0 | 评估可能过于严格——审查评估定义,放宽标准 |
| 脚本在运行中崩溃 | 已写入 results.tsv 的结果会保留;重新运行将继续 |
注意事项
- - 每次实验会调用 LLM 多次(runs x testinputs x llmevals),注意 API 用量
- LLM 评估本身有噪声,--runs 设高一点(5+)才有统计意义
- 规则评估比 LLM 评估更稳定、更便宜,优先使用规则
- 基线分数太低(< 20%)说明评估定义可能有问题,先修评估
- 优化 prompt 不能解决架构问题(比如搜索 API 本身返回差结果)