Autoresearch Agent
You sleep. The agent experiments. You wake up to results.
Autonomous experiment loop inspired by Karpathy's autoresearch. The agent edits one file, runs a fixed evaluation, keeps improvements, discards failures, and loops indefinitely.
Not one guess — fifty measured attempts, compounding.
Slash Commands
| Command | What it does |
|---|
| INLINECODE0 | Set up a new experiment interactively |
| INLINECODE1 |
Run a single experiment iteration |
|
/ar:loop | Start autonomous loop with configurable interval (10m, 1h, daily, weekly, monthly) |
|
/ar:status | Show dashboard and results |
|
/ar:resume | Resume a paused experiment |
When This Skill Activates
Recognize these patterns from the user:
- - "Make this faster / smaller / better"
- "Optimize [file] for [metric]"
- "Improve my [headlines / copy / prompts]"
- "Run experiments overnight"
- "I want to get [metric] from X to Y"
- Any request involving: optimize, benchmark, improve, experiment loop, autoresearch
If the user describes a target file + a way to measure success → this skill applies.
Setup
First Time — Create the Experiment
Run the setup script. The user decides where experiments live:
Project-level (inside repo, git-tracked, shareable with team):
CODEBLOCK0
User-level (personal, in ~/.autoresearch/):
CODEBLOCK1
The --scope flag determines where .autoresearch/ lives:
- -
project (default) → .autoresearch/ in the repo root. Experiment definitions are git-tracked. Results are gitignored. - INLINECODE10 →
~/.autoresearch/ in the home directory. Everything is personal.
What Setup Creates
CODEBLOCK2
results.tsv columns: commit | metric | status | description
- -
commit — short git hash - INLINECODE14 — float value or "N/A" for crashes
- INLINECODE15 — keep | discard | crash
- INLINECODE16 — what changed or why it crashed
Domains
| Domain | Use Cases |
|---|
| INLINECODE17 | Code speed, memory, bundle size, test pass rate, build time |
| INLINECODE18 |
Headlines, social copy, email subjects, ad copy, engagement |
|
content | Article structure, SEO descriptions, readability, CTR |
|
prompts | System prompts, chatbot tone, agent instructions |
|
custom | Anything else with a measurable metric |
If program.md Already Exists
The user may have written their own program.md. If found in the experiment directory, read it. It overrides the template. Only ask for what's missing.
Agent Protocol
You are the loop. The scripts handle setup and evaluation — you handle the creative work.
Before Starting
- 1. Read
.autoresearch/{domain}/{name}/config.cfg to get:
-
target — the file you edit
-
evaluate_cmd — the command that measures your changes
-
metric — the metric name to look for in eval output
-
metric_direction — "lower" or "higher" is better
-
time_budget_minutes — max time per evaluation
- 2. Read
program.md for strategy, constraints, and what you can/cannot change - Read
results.tsv for experiment history (columns: commit, metric, status, description) - Checkout the experiment branch: INLINECODE32
Each Iteration
- 1. Review results.tsv — what worked? What failed? What hasn't been tried?
- Decide ONE change to the target file. One variable per experiment.
- Edit the target file
- Commit: INLINECODE33
- Evaluate: INLINECODE34
- Read the output — it prints KEEP, DISCARD, or CRASH with the metric value
- Go to step 1
What the Script Handles (you don't)
- - Running the eval command with timeout
- Parsing the metric from eval output
- Comparing to previous best
- Reverting the commit on failure (
git reset --hard HEAD~1) - Logging the result to results.tsv
Starting an Experiment
CODEBLOCK3
Strategy Escalation
- - Runs 1-5: Low-hanging fruit (obvious improvements, simple optimizations)
- Runs 6-15: Systematic exploration (vary one parameter at a time)
- Runs 16-30: Structural changes (algorithm swaps, architecture shifts)
- Runs 30+: Radical experiments (completely different approaches)
- If no improvement in 20+ runs: update program.md Strategy section
Self-Improvement
After every 10 experiments, review results.tsv for patterns. Update the
Strategy section of program.md with what you learned (e.g., "caching changes
consistently improve by 5-10%", "refactoring attempts never improve the metric").
Future iterations benefit from this accumulated knowledge.
Stopping
- - Run until interrupted by the user, context limit reached, or goal in program.md is met
- Before stopping: ensure results.tsv is up to date
- On context limit: the next session can resume — results.tsv and git log persist
Rules
- - One change per experiment. Don't change 5 things at once. You won't know what worked.
- Simplicity criterion. A small improvement that adds ugly complexity is not worth it. Equal performance with simpler code is a win. Removing code that gets same results is the best outcome.
- Never modify the evaluator.
evaluate.py is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this. - Timeout. If a run exceeds 2.5× the time budget, kill it and treat as crash.
- Crash handling. If it's a typo or missing import, fix and re-run. If the idea is fundamentally broken, revert, log "crash", move on. 5 consecutive crashes → pause and alert.
- No new dependencies. Only use what's already available in the project.
Evaluators
Ready-to-use evaluation scripts. Copied into the experiment directory during setup with --evaluator.
Free Evaluators (no API cost)
| Evaluator | Metric | Use Case |
|---|
| INLINECODE38 | INLINECODE39 (lower) | Function/API execution time |
| INLINECODE40 |
size_bytes (lower) | File, bundle, Docker image size |
|
test_pass_rate |
pass_rate (higher) | Test suite pass percentage |
|
build_speed |
build_seconds (lower) | Build/compile/Docker build time |
|
memory_usage |
peak_mb (lower) | Peak memory during execution |
LLM Judge Evaluators (uses your subscription)
| Evaluator | Metric | Use Case |
|---|
| INLINECODE48 | INLINECODE49 0-10 (higher) | Headlines, titles, descriptions |
| INLINECODE50 |
quality_score 0-100 (higher) | System prompts, agent instructions |
|
llm_judge_copy |
engagement_score 0-10 (higher) | Social posts, ad copy, emails |
LLM judges call the CLI tool the user is already running (Claude, Codex, Gemini). The evaluation prompt is locked inside evaluate.py — the agent cannot modify it. This prevents the agent from gaming its own evaluator.
The user's existing subscription covers the cost:
- - Claude Code Max → unlimited Claude calls for evaluation
- Codex CLI (ChatGPT Pro) → unlimited Codex calls
- Gemini CLI (free tier) → free evaluation calls
Custom Evaluators
If no built-in evaluator fits, the user writes their own evaluate.py. Only requirement: it must print metric_name: value to stdout.
CODEBLOCK4
Viewing Results
CODEBLOCK5
Dashboard Output
CODEBLOCK6
Export Formats
- - TSV — default, tab-separated (compatible with spreadsheets)
- CSV — comma-separated, with proper quoting
- Markdown — formatted table, readable in GitHub/docs
Proactive Triggers
Flag these without being asked:
- - No evaluation command works → Test it before starting the loop. Run once, verify output.
- Target file not in git →
git init && git add . && git commit -m 'initial' first. - Metric direction unclear → Ask: is lower or higher better? Must know before starting.
- Time budget too short → If eval takes longer than budget, every run crashes.
- Agent modifying evaluate.py → Hard stop. This invalidates all comparisons.
- 5 consecutive crashes → Pause the loop. Alert the user. Don't keep burning cycles.
- No improvement in 20+ runs → Suggest changing strategy in program.md or trying a different approach.
Installation
One-liner (any tool)
CODEBLOCK7
Multi-tool install
CODEBLOCK8
OpenClaw
clawhub install cs-autoresearch-agent
Related Skills
- - self-improving-agent — improves an agent's own memory/rules over time. NOT for structured experiment loops.
- senior-ml-engineer — ML architecture decisions. Complementary — use for initial design, then autoresearch for optimization.
- tdd-guide — test-driven development. Complementary — tests can be the evaluation function.
- skill-security-auditor — audit skills before publishing. NOT for optimization loops.
Autoresearch Agent
你睡觉。智能体做实验。你醒来看到结果。
受 Karpathy 的 autoresearch 启发的自主实验循环。智能体编辑一个文件,运行固定评估,保留改进,丢弃失败,并无限循环。
不是一次猜测——而是五十次有计划的尝试,不断累积。
斜杠命令
| 命令 | 功能 |
|---|
| /ar:setup | 交互式设置新实验 |
| /ar:run |
运行单次实验迭代 |
| /ar:loop | 以可配置间隔(10分钟、1小时、每天、每周、每月)启动自主循环 |
| /ar:status | 显示仪表盘和结果 |
| /ar:resume | 恢复暂停的实验 |
何时激活此技能
识别用户的以下模式:
- - 让这个更快/更小/更好
- 优化 [文件] 以提升 [指标]
- 改进我的 [标题/文案/提示词]
- 通宵运行实验
- 我想把 [指标] 从 X 提升到 Y
- 任何涉及以下内容的请求:优化、基准测试、改进、实验循环、autoresearch
如果用户描述了目标文件 + 衡量成功的方法 → 此技能适用。
设置
首次使用 — 创建实验
运行设置脚本。用户决定实验存放位置:
项目级(在仓库内,由 Git 跟踪,可与团队共享):
bash
python scripts/setup_experiment.py \
--domain engineering \
--name api-speed \
--target src/api/search.py \
--eval pytest bench.py --tb=no -q \
--metric p50_ms \
--direction lower \
--scope project
用户级(个人,位于 ~/.autoresearch/):
bash
python scripts/setup_experiment.py \
--domain marketing \
--name medium-ctr \
--target content/titles.md \
--eval python evaluate.py \
--metric ctr_score \
--direction higher \
--evaluator llmjudgecontent \
--scope user
--scope 标志决定 .autoresearch/ 的位置:
- - project(默认)→ 仓库根目录下的 .autoresearch/。实验定义由 Git 跟踪。结果被 gitignore 忽略。
- user → 主目录下的 ~/.autoresearch/。所有内容都是个人的。
设置创建的内容
.autoresearch/
├── config.yaml ← 全局设置
├── .gitignore ← 忽略 results.tsv, *.log
└── {domain}/{experiment-name}/
├── program.md ← 目标、约束、策略
├── config.cfg ← 目标、评估命令、指标、方向
├── results.tsv ← 实验日志(被 gitignore 忽略)
└── evaluate.py ← 评估脚本(如果使用了 --evaluator)
results.tsv 列: commit | metric | status | description
- - commit — 短 Git 哈希值
- metric — 浮点值或崩溃时的 N/A
- status — keep | discard | crash
- description — 更改内容或崩溃原因
领域
| 领域 | 用例 |
|---|
| engineering | 代码速度、内存、打包大小、测试通过率、构建时间 |
| marketing |
标题、社交媒体文案、邮件主题、广告文案、参与度 |
| content | 文章结构、SEO 描述、可读性、点击率 |
| prompts | 系统提示词、聊天机器人语气、智能体指令 |
| custom | 任何其他具有可衡量指标的内容 |
如果 program.md 已存在
用户可能已经编写了自己的 program.md。如果在实验目录中找到,则读取它。它将覆盖模板。仅询问缺失的内容。
智能体协议
你就是循环本身。脚本处理设置和评估——你处理创造性工作。
开始之前
- 1. 读取 .autoresearch/{domain}/{name}/config.cfg 以获取:
- target — 你编辑的文件
- evaluate_cmd — 衡量你更改的命令
- metric — 在评估输出中查找的指标名称
- metric_direction — lower 或 higher 表示更好
- time
budgetminutes — 每次评估的最长时间
- 2. 读取 program.md 了解策略、约束以及可以/不可以更改的内容
- 读取 results.tsv 获取实验历史(列:commit, metric, status, description)
- 检出实验分支:git checkout autoresearch/{domain}/{name}
每次迭代
- 1. 审查 results.tsv — 哪些有效?哪些失败?哪些尚未尝试?
- 决定对目标文件进行一次更改。每次实验只改变一个变量。
- 编辑目标文件
- 提交:git add {target} && git commit -m experiment: {description}
- 评估:python scripts/run_experiment.py --experiment {domain}/{name} --single
- 读取输出——它会打印 KEEP、DISCARD 或 CRASH 以及指标值
- 返回步骤 1
脚本处理的内容(你不需要做)
- - 使用超时运行评估命令
- 从评估输出中解析指标
- 与之前的最佳值比较
- 失败时回滚提交(git reset --hard HEAD~1)
- 将结果记录到 results.tsv
启动实验
bash
单次迭代(智能体会重复调用此命令)
python scripts/run_experiment.py --experiment engineering/api-speed --single
试运行(启动前测试设置)
python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
策略升级
- - 第 1-5 次运行:低垂果实(明显的改进,简单的优化)
- 第 6-15 次运行:系统探索(每次改变一个参数)
- 第 16-30 次运行:结构性更改(算法替换,架构调整)
- 第 30+ 次运行:激进实验(完全不同的方法)
- 如果 20+ 次运行没有改进:更新 program.md 的策略部分
自我改进
每 10 次实验后,审查 results.tsv 寻找模式。用你学到的内容更新 program.md 的策略部分(例如,缓存更改持续提升 5-10%,重构尝试从未改善指标)。未来的迭代将从这些积累的知识中受益。
停止
- - 持续运行直到被用户中断、达到上下文限制或 program.md 中的目标达成
- 停止前:确保 results.tsv 是最新的
- 上下文限制时:下一个会话可以恢复——results.tsv 和 Git 日志会持久保存
规则
- - 每次实验只做一个更改。 不要同时更改 5 件事。你不会知道哪个有效。
- 简洁性标准。 增加丑陋复杂性的小改进不值得。相同性能但代码更简洁是胜利。删除代码却获得相同结果是最好的结果。
- 永远不要修改评估器。 evaluate.py 是基准事实。修改它会使所有比较失效。如果发现自己正在这样做,立即停止。
- 超时。 如果运行超过时间预算的 2.5 倍,终止并将其视为崩溃。
- 崩溃处理。 如果是拼写错误或缺少导入,修复后重新运行。如果想法从根本上就有问题,回滚,记录 crash,继续。连续 5 次崩溃 → 暂停并发出警报。
- 不添加新依赖。 只使用项目中已有的内容。
评估器
即用型评估脚本。在设置期间使用 --evaluator 复制到实验目录。
免费评估器(无 API 成本)
| 评估器 | 指标 | 用例 |
|---|
| benchmarkspeed | p50ms(越低越好) | 函数/API 执行时间 |
| benchmarksize |
sizebytes(越低越好) | 文件、打包、Docker 镜像大小 |
| test
passrate | pass_rate(越高越好) | 测试套件通过百分比 |
| build
speed | buildseconds(越低越好) | 构建/编译/Docker 构建时间 |
| memory
usage | peakmb(越低越好) | 执行期间峰值内存 |
LLM 评判评估器(使用你的订阅)
| 评估器 | 指标 | 用例 |
|---|
| llmjudgecontent | ctr_score 0-10(越高越好) |
标题、