Autoresearch Agent

You sleep. The agent experiments. You wake up to results.

Autonomous experiment loop inspired by Karpathy's autoresearch. The agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely.

Not one guess — fifty measured attempts, compounding.

Slash Commands

Command	What it does
INLINECODE0	Set up a new experiment interactively
INLINECODE1

When This Skill Activates

Recognize these patterns from the user:

- "Make this faster / smaller / better"
"Optimize [file] for [metric]"
"Improve my [headlines / copy / prompts]"
"Run experiments overnight"
"I want to get [metric] from X to Y"
Any request involving: optimize, benchmark, improve, experiment loop, autoresearch

If the user describes a target file + a way to measure success → this skill applies.

Setup

First Time — Create the Experiment

Run the setup script. The user decides where experiments live:

Project-level (inside repo, git-tracked, shareable with team):
CODEBLOCK0

User-level (personal, in ~/.autoresearch/):
CODEBLOCK1

The --scope flag determines where .autoresearch/ lives:

- project (default) → .autoresearch/ in the repo root. Experiment definitions are git-tracked. Results are gitignored.
INLINECODE10 → ~/.autoresearch/ in the home directory. Everything is personal.

What Setup Creates

CODEBLOCK2

results.tsv columns: commit | metric | status | description

- commit — short git hash
INLINECODE14 — float value or "N/A" for crashes
INLINECODE15 — keep | discard | crash
INLINECODE16 — what changed or why it crashed

Domains

Domain	Use Cases
INLINECODE17	Code speed, memory, bundle size, test pass rate, build time
INLINECODE18

If `program.md` Already Exists

The user may have written their own program.md. If found in the experiment directory, read it. It overrides the template. Only ask for what's missing.

Agent Protocol

You are the loop. The scripts handle setup and evaluation — you handle the creative work.

Before Starting

1. Read .autoresearch/{domain}/{name}/config.cfg to get:

- target — the file you edit - evaluate_cmd — the command that measures your changes - metric — the metric name to look for in eval output - metric_direction — "lower" or "higher" is better - time_budget_minutes — max time per evaluation

2. Read program.md for strategy, constraints, and what you can/cannot change
Read results.tsv for experiment history (columns: commit, metric, status, description)
Checkout the experiment branch: INLINECODE32

Each Iteration

1. Review results.tsv — what worked? What failed? What hasn't been tried?
Decide ONE change to the target file. One variable per experiment.
Edit the target file
Commit: INLINECODE33
Evaluate: INLINECODE34
Read the output — it prints KEEP, DISCARD, or CRASH with the metric value
Go to step 1

What the Script Handles (you don't)

- Running the eval command with timeout
Parsing the metric from eval output
Comparing to previous best
Reverting the commit on failure (git reset --hard HEAD~1)
Logging the result to results.tsv

Starting an Experiment

CODEBLOCK3

Strategy Escalation

- Runs 1-5: Low-hanging fruit (obvious improvements, simple optimizations)
Runs 6-15: Systematic exploration (vary one parameter at a time)
Runs 16-30: Structural changes (algorithm swaps, architecture shifts)
Runs 30+: Radical experiments (completely different approaches)
If no improvement in 20+ runs: update program.md Strategy section

Self-Improvement

After every 10 experiments, review results.tsv for patterns. Update the Strategy section of program.md with what you learned (e.g., "caching changes consistently improve by 5-10%", "refactoring attempts never improve the metric"). Future iterations benefit from this accumulated knowledge.

Stopping

- Run until interrupted by the user, context limit reached, or goal in program.md is met
Before stopping: ensure results.tsv is up to date
On context limit: the next session can resume — results.tsv and git log persist

Rules

- One change per experiment. Don't change 5 things at once. You won't know what worked.
Simplicity criterion. A small improvement that adds ugly complexity is not worth it. Equal performance with simpler code is a win. Removing code that gets same results is the best outcome.
Never modify the evaluator. evaluate.py is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this.
Timeout. If a run exceeds 2.5× the time budget, kill it and treat as crash.
Crash handling. If it's a typo or missing import, fix and re-run. If the idea is fundamentally broken, revert, log "crash", move on. 5 consecutive crashes → pause and alert.
No new dependencies. Only use what's already available in the project.

Evaluators

Ready-to-use evaluation scripts. Copied into the experiment directory during setup with --evaluator.

Free Evaluators (no API cost)

Evaluator	Metric	Use Case
INLINECODE38	INLINECODE39 (lower)	Function/API execution time
INLINECODE40

LLM Judge Evaluators (uses your subscription)

Evaluator	Metric	Use Case
INLINECODE48	INLINECODE49 0-10 (higher)	Headlines, titles, descriptions
INLINECODE50

LLM judges call the CLI tool the user is already running (Claude, Codex, Gemini). The evaluation prompt is locked inside evaluate.py — the agent cannot modify it. This prevents the agent from gaming its own evaluator.

The user's existing subscription covers the cost:

- Claude Code Max → unlimited Claude calls for evaluation
Codex CLI (ChatGPT Pro) → unlimited Codex calls
Gemini CLI (free tier) → free evaluation calls

Custom Evaluators

If no built-in evaluator fits, the user writes their own evaluate.py. Only requirement: it must print metric_name: value to stdout.

CODEBLOCK4

Viewing Results

CODEBLOCK5

Dashboard Output

CODEBLOCK6

Export Formats

- TSV — default, tab-separated (compatible with spreadsheets)
CSV — comma-separated, with proper quoting
Markdown — formatted table, readable in GitHub/docs

Proactive Triggers

Flag these without being asked:

- No evaluation command works → Test it before starting the loop. Run once, verify output.
Target file not in git → git init && git add . && git commit -m 'initial' first.
Metric direction unclear → Ask: is lower or higher better? Must know before starting.
Time budget too short → If eval takes longer than budget, every run crashes.
Agent modifying evaluate.py → Hard stop. This invalidates all comparisons.
5 consecutive crashes → Pause the loop. Alert the user. Don't keep burning cycles.
No improvement in 20+ runs → Suggest changing strategy in program.md or trying a different approach.

Installation

One-liner (any tool)

CODEBLOCK7

Multi-tool install

CODEBLOCK8

OpenClaw

clawhub install cs-autoresearch-agent

Related Skills

- self-improving-agent — improves an agent's own memory/rules over time. NOT for structured experiment loops.
senior-ml-engineer — ML architecture decisions. Complementary — use for initial design, then autoresearch for optimization.
tdd-guide — test-driven development. Complementary — tests can be the evaluation function.
skill-security-auditor — audit skills before publishing. NOT for optimization loops.

Autoresearch Agent

你睡觉。智能体做实验。你醒来看到结果。

受 Karpathy 的 autoresearch 启发的自主实验循环。智能体编辑一个文件，运行固定评估，保留改进，丢弃失败，并无限循环。

不是一次猜测——而是五十次有计划的尝试，不断累积。

斜杠命令

命令	功能
/ar:setup	交互式设置新实验
/ar:run

何时激活此技能

识别用户的以下模式：

- 让这个更快/更小/更好
优化 [文件] 以提升 [指标]
改进我的 [标题/文案/提示词]
通宵运行实验
我想把 [指标] 从 X 提升到 Y
任何涉及以下内容的请求：优化、基准测试、改进、实验循环、autoresearch

如果用户描述了目标文件 + 衡量成功的方法 → 此技能适用。

设置

首次使用 — 创建实验

运行设置脚本。用户决定实验存放位置：

项目级（在仓库内，由 Git 跟踪，可与团队共享）：
bash
python scripts/setup_experiment.py \
--domain engineering \
--name api-speed \
--target src/api/search.py \
--eval pytest bench.py --tb=no -q \
--metric p50_ms \
--direction lower \
--scope project

用户级（个人，位于 ~/.autoresearch/）：
bash
python scripts/setup_experiment.py \
--domain marketing \
--name medium-ctr \
--target content/titles.md \
--eval python evaluate.py \
--metric ctr_score \
--direction higher \
--evaluator llmjudgecontent \
--scope user

--scope 标志决定 .autoresearch/ 的位置：

- project（默认）→ 仓库根目录下的 .autoresearch/。实验定义由 Git 跟踪。结果被 gitignore 忽略。
user → 主目录下的 ~/.autoresearch/。所有内容都是个人的。

设置创建的内容

.autoresearch/
├── config.yaml ← 全局设置
├── .gitignore ← 忽略 results.tsv, *.log
└── {domain}/{experiment-name}/
├── program.md ← 目标、约束、策略
├── config.cfg ← 目标、评估命令、指标、方向
├── results.tsv ← 实验日志（被 gitignore 忽略）
└── evaluate.py ← 评估脚本（如果使用了 --evaluator）

results.tsv 列： commit | metric | status | description

- commit — 短 Git 哈希值
metric — 浮点值或崩溃时的 N/A
status — keep | discard | crash
description — 更改内容或崩溃原因

领域

领域	用例
engineering	代码速度、内存、打包大小、测试通过率、构建时间
marketing

如果 program.md 已存在

用户可能已经编写了自己的 program.md。如果在实验目录中找到，则读取它。它将覆盖模板。仅询问缺失的内容。

智能体协议

你就是循环本身。脚本处理设置和评估——你处理创造性工作。

开始之前

1. 读取 .autoresearch/{domain}/{name}/config.cfg 以获取：

- target — 你编辑的文件 - evaluate_cmd — 衡量你更改的命令 - metric — 在评估输出中查找的指标名称 - metric_direction — lower 或 higher 表示更好 - timebudgetminutes — 每次评估的最长时间

2. 读取 program.md 了解策略、约束以及可以/不可以更改的内容
读取 results.tsv 获取实验历史（列：commit, metric, status, description）
检出实验分支：git checkout autoresearch/{domain}/{name}

每次迭代

1. 审查 results.tsv — 哪些有效？哪些失败？哪些尚未尝试？
决定对目标文件进行一次更改。每次实验只改变一个变量。
编辑目标文件
提交：git add {target} && git commit -m experiment: {description}
评估：python scripts/run_experiment.py --experiment {domain}/{name} --single
读取输出——它会打印 KEEP、DISCARD 或 CRASH 以及指标值
返回步骤 1

脚本处理的内容（你不需要做）

- 使用超时运行评估命令
从评估输出中解析指标
与之前的最佳值比较
失败时回滚提交（git reset --hard HEAD~1）
将结果记录到 results.tsv

启动实验

bash

单次迭代（智能体会重复调用此命令）

python scripts/run_experiment.py --experiment engineering/api-speed --single

试运行（启动前测试设置）

python scripts/run_experiment.py --experiment engineering/api-speed --dry-run

策略升级

- 第 1-5 次运行：低垂果实（明显的改进，简单的优化）
第 6-15 次运行：系统探索（每次改变一个参数）
第 16-30 次运行：结构性更改（算法替换，架构调整）
第 30+ 次运行：激进实验（完全不同的方法）
如果 20+ 次运行没有改进：更新 program.md 的策略部分

自我改进

每 10 次实验后，审查 results.tsv 寻找模式。用你学到的内容更新 program.md 的策略部分（例如，缓存更改持续提升 5-10%，重构尝试从未改善指标）。未来的迭代将从这些积累的知识中受益。

停止

- 持续运行直到被用户中断、达到上下文限制或 program.md 中的目标达成
停止前：确保 results.tsv 是最新的
上下文限制时：下一个会话可以恢复——results.tsv 和 Git 日志会持久保存

规则

- 每次实验只做一个更改。 不要同时更改 5 件事。你不会知道哪个有效。
简洁性标准。 增加丑陋复杂性的小改进不值得。相同性能但代码更简洁是胜利。删除代码却获得相同结果是最好的结果。
永远不要修改评估器。 evaluate.py 是基准事实。修改它会使所有比较失效。如果发现自己正在这样做，立即停止。
超时。 如果运行超过时间预算的 2.5 倍，终止并将其视为崩溃。
崩溃处理。 如果是拼写错误或缺少导入，修复后重新运行。如果想法从根本上就有问题，回滚，记录 crash，继续。连续 5 次崩溃 → 暂停并发出警报。
不添加新依赖。 只使用项目中已有的内容。

评估器

即用型评估脚本。在设置期间使用 --evaluator 复制到实验目录。

免费评估器（无 API 成本）

评估器	指标	用例
benchmarkspeed	p50ms（越低越好）	函数/API 执行时间
benchmarksize

LLM 评判评估器（使用你的订阅）

评估器	指标	用例
llmjudgecontent	ctr_score 0-10（越高越好）

标题、

autoresearch-agent自主研究代理