autoresearch

Autonomous prompt optimization for AI agent skills. Runs controlled experiments to find better prompt variants using the Karpathy autoresearch pattern: generate hypothesis, mutate prompt, evaluate, repeat.

When to use

- 用户说"优化一下这个 skill" / User says "optimize this skill's prompt"
用户要对比不同 prompt 版本的效果 / User wants to benchmark prompt variants
用户说"run autoresearch on X" / "eval skill X" / "improve skill X"
用户对 skill 输出质量不满，想系统性改进 / User is unhappy with skill output quality and wants systematic improvement

Do not use:

- 一次性的小改动（直接改 prompt 即可） / One-off prompt tweaks — just edit the prompt directly
调试某个特定失败 case / Debugging a specific failure — investigate the root cause instead
Skill 脚本本身有 bug（代码逻辑问题不是 prompt 问题） / Skill script has a bug — fix the code, not the prompt

Requirements

- Python 3.10+
INLINECODE0 script in the skill directory
LLM API access (MiniMax, OpenAI, or Anthropic)
Target skill must have a prompt file (SKILL.md, SYSTEM.md, or similar)

Procedure

Always follow these steps in order: (1) Create eval.json, (2) Run autoresearch command, (3) Review results and apply best prompt.

Step 1: Gather context

Before running, you need:

Parameter	Description	Example
INLINECODE1	Path to the skill directory or prompt file to optimize	INLINECODE2
INLINECODE3

Step 2: Create eval.json

Define test inputs and evaluation criteria. Each eval is a binary pass/fail check.

CODEBLOCK0

Rule types:

Rule	Parameters	Description
INLINECODE14	INLINECODE15	Pass if regex matches output
INLINECODE16

LLM eval type:

Field	Description
INLINECODE28	Must be INLINECODE29
INLINECODE30

See eval-guide.md for detailed guidance on writing effective evals.

Step 3: Run autoresearch

CODEBLOCK1

Step 4: Review results and apply changes

The script writes results to results.tsv in the working directory. Each row is one experiment:

CODEBLOCK2

Find the best performing variant:
CODEBLOCK3

Apply the winning prompt to your skill by copying the optimized prompt text to replace the original.

Example: optimizing brain-search

CODEBLOCK4

Failure handling

Issue	Action
LLM API rate limit	Script auto-retries with backoff; if persistent, reduce INLINECODE36
Target file not found

Check path, must be readable prompt/skill file | | All experiments score 0 | Evals may be too strict — review eval definitions, loosen criteria | | Script crashes mid-run | Results already written to results.tsv are preserved; re-run continues |

Gotchas

- 每次实验会调用 LLM 多次（runs x testinputs x llmevals），注意 API 用量 / Each experiment makes multiple LLM calls — watch API usage
LLM eval 本身有噪声，--runs 设高一点（5+）才有统计意义 / LLM evals are noisy, use 5+ runs for statistical significance
Rule evals 比 LLM evals 更稳定、更便宜，优先用 rule / Rule evals are more stable and cheaper — prefer them
Baseline 分数太低（< 20%）说明 eval 定义可能有问题，先修 eval / If baseline score is very low, fix evals first
优化 prompt 不能解决架构问题（比如搜索 API 本身返回差结果） / Prompt optimization cannot fix architectural issues

autoresearch

AI智能体技能的自主提示优化。通过运行受控实验，使用Karpathy自主研究模式寻找更好的提示变体：生成假设、变异提示、评估、重复。

使用时机

- 用户说优化一下这个 skill
用户要对比不同 prompt 版本的效果
用户说run autoresearch on X / eval skill X / improve skill X
用户对 skill 输出质量不满，想系统性改进

不要使用：

- 一次性的小改动（直接改 prompt 即可）
调试某个特定失败 case（应调查根本原因）
Skill 脚本本身有 bug（代码逻辑问题不是 prompt 问题）

要求

- Python 3.10+
skill 目录中存在 autoresearch.py 脚本
LLM API 访问权限（MiniMax、OpenAI 或 Anthropic）
目标 skill 必须包含提示文件（SKILL.md、SYSTEM.md 或类似文件）

流程

始终按以下顺序执行步骤：(1) 创建 eval.json，(2) 运行 autoresearch 命令，(3) 审查结果并应用最佳提示。

步骤 1：收集上下文

运行前，需要以下参数：

参数	描述	示例
--target	要优化的 skill 目录或提示文件路径	../workspace/skills/brain-search/SKILL.md
--evals

步骤 2：创建 eval.json

定义测试输入和评估标准。每个评估都是二元的通过/失败检查。

json
{
test_inputs: [
search for latest AI agent frameworks,
find news about LLM inference optimization,
搜一下 transformer 架构的最新进展
],
evals: [
{
name: has_sources,
type: rule,
rule: regex,
pattern: (https?://|Source:|来源:)
},
{
name: nohallucinatedurls,
type: rule,
rule: banned_phrases,
phrases: [example.com, placeholder.url]
},
{
name: sufficient_detail,
type: rule,
rule: word_count,
min: 50,
max: 500
},
{
name: contains_summary,
type: rule,
rule: contains,
values: [summary, key findings, 结论]
},
{
name: noapologyprefix,
type: rule,
rule: not_contains,
values: [I apologize, Im sorry, but]
},
{
name: actionable_output,
type: llm,
question: Does the response provide actionable information the user can immediately use (links, specific facts, concrete next steps)?,
pass_description: The response contains specific actionable items like URLs, concrete facts, or clear next steps,
fail_description: The response is vague, generic, or lacks specific actionable information
}
]
}

规则类型：

规则	参数	描述
regex	pattern	如果正则表达式匹配输出则通过
banned_phrases

LLM 评估类型：

字段	描述
type	必须为 llm
name

有关编写有效评估的详细指南，请参阅 eval-guide.md。

步骤 3：运行 autoresearch

bash
python autoresearch.py \
--target ../workspace/skills/brain-search/SKILL.md \
--evals eval.json \
--provider minimax \
--runs 5 \
--max-experiments 30 \
--dashboard

步骤 4：审查结果并应用更改

脚本将结果写入工作目录中的 results.tsv。每行代表一个实验：

experimentid parentid mutationdescription avgscore passrate evalsdetail prompt_diff

找到表现最佳的变体：
bash
cat results.tsv | sort -k4 -nr | head -5

将获胜提示应用到您的 skill，复制优化后的提示文本替换原始内容。

示例：优化 brain-search

User: brain-search 的搜索结果经常缺少来源链接，帮我优化一下

完整流程:

1. 创建 eval.json:

{ test_inputs: [ search for latest news on OpenAI, 搜一下最新的 AI 芯片进展, find recent papers on RAG optimization, what happened with Anthropic this week, 查查 GPU 价格趋势 ], evals: [ { name: has_urls, type: rule, rule: regex, pattern: https?://[^\\s]+ }, { name: min2sources, type: rule, rule: regex, pattern: https?://[^\\s]+.*https?://[^\\s]+ }, { name: structured_output, type: llm, question: Is the output well-structured with clear sections?, pass_description: Output uses clear structure like bullets or headers, fail_description: Output is a wall of text without clear structure } ] }

2. 运行命令:

python autoresearch.py \ --target ../workspace/skills/brain-search/SKILL.md \ --evals eval.json \ --runs 5 \ --max-experiments 20

3. 查看并应用结果:

- 检查 results.tsv 找最高分变体 - 查看 mutation_description 了解关键改动 - 将最佳 prompt 应用到原始 SKILL.md

失败处理

问题	操作
LLM API 速率限制	脚本自动重试并退避；如果持续存在，减少 --runs
目标文件未找到

注意事项

- 每次实验会调用 LLM 多次（runs x testinputs x llmevals），注意 API 用量
LLM 评估本身有噪声，--runs 设高一点（5+）才有统计意义
规则评估比 LLM 评估更稳定、更便宜，优先使用规则
基线分数太低（< 20%）说明评估定义可能有问题，先修评估
优化 prompt 不能解决架构问题（比如搜索 API 本身返回差结果）

autoresearch自动研究

autoresearch

autoresearch

When to use

Requirements