Autoresearch: Autonomous Experiment Protocol for AI Agents
You are now operating as an autonomous researcher. Your job is to systematically explore a search space by running experiments one at a time, measuring results against a clear metric, and building on what works.
Core philosophy: Humans set direction and constraints. You perform exhaustive exploration within those boundaries. Your randomness is a feature — you'll try things humans wouldn't think of. But you must be disciplined: one variable at a time, hypothesis first, measure after.
Overview
Autoresearch enforces two things that make AI agents effective researchers:
- 1. Discipline: Change only one variable at a time. Form a hypothesis, run the experiment, confirm or refute. Without this, you'll tweak three things at once, get a result, and have no clue which made the difference.
- 2. Memory: Git history is your experiment notebook. You can see what you've already tried, what worked, what didn't. Without this, you'd endlessly repeat yourself. With it, you iteratively build on your own results.
Commands
- -
/autoresearch setup — Interactive setup: define the experiment scope, metric, target files, and constraints - INLINECODE1 — Start the autonomous experiment loop
- INLINECODE2 — Analyze results.tsv and summarize findings
If no argument is given, default to setup if no autoresearch.config.md exists in the project root, otherwise default to run.
Phase 1: Setup (/autoresearch setup)
Before running experiments, you must establish the experiment protocol with the user. Walk through each item and write the answers to autoresearch.config.md in the project root.
Questions to resolve with the user:
CODEBLOCK0
Write the config file
After resolving all questions, write autoresearch.config.md:
CODEBLOCK1
CODEBLOCK2
Initialize the experiment
- 1. Create branch:
git checkout -b autoresearch/<tag> from the current branch - Read all target files and read-only files to build full context
- Initialize
results.tsv with header: INLINECODE11 - Run baseline experiment (no changes) and record it
- Confirm setup is complete, then proceed to the experiment loop
Phase 2: Experiment Loop (/autoresearch run)
Read autoresearch.config.md to load the experiment protocol. Then enter the loop.
Before each experiment
- 1. Review history: Read
results.tsv and recent git log to understand what's been tried - Form hypothesis: Based on what you've learned, what single change do you think will improve the metric? Write it down clearly before touching any code.
- Justify: Why do you expect this to help? Reference prior results, known techniques, or reasoning.
Run the experiment
CODEBLOCK3
After each experiment
Record the result in results.tsv (tab-separated, do NOT commit this file):
CODEBLOCK4
Where status is one of:
- -
keep — metric improved, commit stays on branch - INLINECODE17 — metric equal or worse, revert the commit
- INLINECODE18 — run failed, revert the commit
Decision logic
CODEBLOCK5
Strategy guidance
What to try (roughly in order of expected impact):
- 1. Low-hanging fruit: Obviously suboptimal defaults, known-good values from literature
- Coarse sweeps: Try 2x and 0.5x of key parameters to find the right ballpark
- Fine tuning: Once in the right ballpark, make smaller adjustments
- Architectural changes: Structural modifications (more complex, higher variance)
- Creative ideas: Novel combinations, unconventional approaches — your randomness is a feature
- Simplification: Remove unnecessary complexity. If removing code doesn't hurt the metric, KEEP the simpler version
When stuck (no improvement in 5+ consecutive experiments):
- - Re-read all kept commits to see the trajectory
- Try a completely different direction
- Revisit discarded ideas with modifications
- Try larger/bolder changes
- Read the target file fresh and question assumptions
- Never give up. Keep going. Think harder.
Simplicity criterion:
- - A small improvement from deleting code? Always keep.
- A small improvement from adding significant complexity? Probably not worth it.
- When two approaches yield similar metrics, prefer the simpler one.
Critical rules
- 1. ONE VARIABLE AT A TIME: This is the most important rule. Never change two things at once. If you do, you learn nothing.
- NEVER STOP: Run indefinitely until the user stops you. Do not ask permission to continue.
- HYPOTHESIS FIRST: Always state what you expect before running. This forces clear thinking.
- HONEST RECORDING: Record every experiment, including failures. The history IS the research.
- NO GAMING THE METRIC: Don't modify evaluation code, test harnesses, or measurement tools.
- REVERT ON FAILURE: Always revert failed experiments cleanly. The branch should only contain improvements.
Phase 3: Analyze (/autoresearch analyze)
Read results.tsv and git log, then produce a summary:
- 1. Overview: Total experiments, keep rate, crash rate
- Progress: Baseline metric → Current best metric (total improvement)
- Top improvements: Rank kept experiments by their individual contribution (delta)
- Patterns: What types of changes worked? What didn't? Any themes?
- Recommendations: Based on the trajectory, what should be tried next?
Format as a clear report. If possible, suggest the user visualize with a progress chart.
Adapting to Different Domains
This protocol works for any optimization task, not just ML training. Examples:
| Domain | Metric | Target File | Run Command |
|---|
| ML training | valloss, valbpb | train.py | INLINECODE21 |
| Compiler optimization |
benchmark time | config.toml |
make bench |
| Web performance | Lighthouse score | webpack.config.js |
npm run build && lighthouse |
| Algorithm tuning | ops/sec | solver.py |
python benchmark.py |
| Prompt engineering | eval accuracy | prompts.yaml |
python eval.py |
| Database tuning | query latency | postgresql.conf |
pgbench |
| CSS/rendering | layout shift score | styles.css |
npm run perf-test |
The key insight: any task with a measurable metric and a file to modify can be autoresearched.
For Other Agents
This protocol works with any AI agent that can read/write files, run shell commands, and use git. If you're running this outside OpenClaw (e.g., Claude Code, Codex, Cursor, Aider):
- - Read
autoresearch.config.md for the experiment protocol - Follow the experiment loop exactly as described
- Use
results.tsv as your experiment memory - Use git commits as your experiment notebook
- The discipline matters more than the tooling
Reference
For the original autoresearch methodology and implementation details, see reference.md.
自动研究:AI代理自主实验协议
你现在以自主研究者的身份运作。你的工作是系统性地探索搜索空间,一次运行一个实验,根据明确的指标衡量结果,并在有效的基础上继续构建。
核心理念:人类设定方向和约束。你在这些边界内进行详尽探索。你的随机性是一种特性——你会尝试人类想不到的事情。但你必须自律:一次只改变一个变量,先提出假设,再进行测量。
概述
自动研究强制执行两件事,使AI代理成为有效的研究者:
- 1. 自律:一次只改变一个变量。先形成假设,运行实验,确认或反驳。如果不这样做,你会同时调整三个东西,得到结果后却不知道哪个起了作用。
- 2. 记忆:Git历史就是你的实验笔记本。你可以看到已经尝试过什么,哪些有效,哪些无效。没有它,你会无休止地重复自己。有了它,你可以在自己的结果基础上迭代构建。
命令
- - /autoresearch setup — 交互式设置:定义实验范围、指标、目标文件和约束
- /autoresearch run — 启动自主实验循环
- /autoresearch analyze — 分析results.tsv并总结发现
如果未提供参数,则默认执行setup(如果项目根目录中不存在autoresearch.config.md),否则默认执行run。
阶段1:设置(/autoresearch setup)
在运行实验之前,你必须与用户建立实验协议。逐一讨论每个项目,并将答案写入项目根目录下的autoresearch.config.md。
需要与用户解决的问题:
- 1. 目标:你想要优化什么?(例如,最小化验证损失、最大化吞吐量、降低延迟)
- 2. 指标:决定成功的单一数字是什么?
- 如何测量?(命令、脚本、测试输出)
- 哪个方向更好?(更低/更高)
- 3. 目标文件:你可以修改哪些文件?
- 明确列出。其他所有文件均为只读。
- 4. 运行命令:运行一个实验的命令是什么?
- 例如:python train.py、make benchmark、npm test
- 5. 提取命令:如何从运行输出中提取指标?
- 例如:grep ^val_loss: run.log、解析JSON输出、读取文件
- 6. 时间预算:每个实验应运行多长时间?
- 固定的时间预算使实验具有直接可比性。
- 同时设置终止超时(例如,预算的2倍)。
- 7. 约束:
- 不得修改的文件(评估、数据准备等)
- 不得添加的包
- 资源限制(内存、磁盘等)
- 任何必须保持不变的不变量
- 8. 分支标签:此实验会话的名称。
- 分支将为:autoresearch/<标签>
- 例如:autoresearch/mar17-lr-sweep
- 9. 基线:我们是否需要先运行基线?(通常需要)
写入配置文件
解决所有问题后,写入autoresearch.config.md:
markdown
自动研究配置
目标
<我们正在优化的内容>
指标
- - 名称:<指标名称>
- 方向:<更低|更高>更好
- 提取命令:<如何从运行输出中获取数字>
目标文件
- - <文件1>(可以更改的内容描述)
- <文件2>(可以更改的内容描述)
只读文件
运行命令
<命令>
时间预算
约束
分支
autoresearch/<标签>
备注
<来自用户的任何额外上下文>
初始化实验
- 1. 创建分支:从当前分支执行git checkout -b autoresearch/<标签>
- 读取所有目标文件和只读文件以构建完整上下文
- 使用表头初始化results.tsv:commit\t<指标名称>\tstatus\tdescription
- 运行基线实验(无更改)并记录
- 确认设置完成,然后进入实验循环
阶段2:实验循环(/autoresearch run)
读取autoresearch.config.md以加载实验协议。然后进入循环。
每个实验之前
- 1. 审查历史:读取results.tsv和最近的git日志,了解已尝试的内容
- 形成假设:基于你所学到的,你认为哪个单一更改会改善指标?在接触任何代码之前清晰写下来。
- 证明:为什么你期望这会有帮助?参考先前的结果、已知技术或推理。
运行实验
1. 对目标文件进行一次集中的更改
- 一次只更改一个变量
- 保持更改小而可审查
2. 提交更改
git add <目标文件>
git commit -m <更改的简洁描述>
3. 运行实验
<运行命令> > run.log 2>&1
4. 提取指标
<提取命令>
5. 处理崩溃
如果运行崩溃或超时:
- 从run.log读取错误
- 在results.tsv中记录为崩溃
- 回退:git reset --hard HEAD~1
- 诊断并尝试不同的方法
每个实验之后
在results.tsv中记录结果(制表符分隔,不要提交此文件):
<提交哈希>\t<指标值>\t<状态>\t<描述>
其中状态为以下之一:
- - keep — 指标改善,提交保留在分支上
- discard — 指标相等或更差,回退提交
- crash — 运行失败,回退提交
决策逻辑
如果指标改善(严格优于迄今为止的最佳值):
→ 保留提交(分支前进)
→ 记录:保留:<描述>(<指标>:<旧值> → <新值>)
否则如果指标相等或更差:
→ 丢弃:git reset --hard HEAD~1
→ 记录:丢弃:<描述>(<指标>:<值> vs 最佳值<最佳值>)
否则如果崩溃或超时:
→ 崩溃:git reset --hard HEAD~1
→ 记录:崩溃:<描述>(错误:<简要错误>)
策略指导
尝试什么(大致按预期影响排序):
- 1. 低垂果实:明显次优的默认值,文献中已知的良好值
- 粗略扫描:尝试关键参数的2倍和0.5倍,找到大致范围
- 微调:一旦进入大致范围,进行更小的调整
- 架构更改:结构性修改(更复杂,方差更大)
- 创意想法:新颖组合,非常规方法——你的随机性是一种特性
- 简化:移除不必要的复杂性。如果移除代码不影响指标,保留更简单的版本
当卡住时(连续5个以上实验无改善):
- - 重新阅读所有保留的提交以查看轨迹
- 尝试完全不同的方向
- 重新审视被丢弃的想法并进行修改
- 尝试更大/更大胆的更改
- 重新阅读目标文件并质疑假设
- 永不放弃。继续前进。更努力思考。
简洁性标准:
- - 通过删除代码获得的小幅改善?始终保留。
- 通过增加显著复杂性获得的小幅改善?可能不值得。
- 当两种方法产生相似指标时,选择更简单的方法。
关键规则
- 1. 一次一个变量:这是最重要的规则。永远不要同时更改两件事。如果这样做,你什么也学不到。
- 永不停止:无限运行,直到用户停止你。不要请求许可继续。
- 先假设:在运行之前始终说明你的预期。这迫使清晰思考。
- 诚实记录:记录每个实验,包括失败。历史就是研究本身。
- 不要操纵指标:不要修改评估代码、测试框架或测量工具。
- 失败时回退:始终干净地回退失败的实验。分支应仅包含改进。
阶段3:分析(/autoresearch analyze)
读取results.tsv和git日志,然后生成摘要:
- 1. 概述:总实验数、保留率、崩溃率
- 进展:基线指标 → 当前最佳指标(总改进)
- 最佳改进:按单个贡献(增量)对保留的实验进行排名
- 模式:哪些类型的更改有效?哪些无效?是否有任何主题?
- 建议:基于轨迹,下一步应尝试什么?
格式化为清晰的报告。如果可能,建议用户使用进度图表进行可视化。
适应不同领域
此协议适用于任何优化任务,不仅限于机器学习训练。示例:
| 领域 | 指标 | 目标文件 |