autoresearch
Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology.
Triggers
Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on.
Description
Autonomous prompt/strategy optimization using Karpathy's autoresearch pattern. Mutate → evaluate → keep improvements. Works on anything with a measurable score: trading strategies, content scripts, thumbnails, ad copy, email subjects.
How It Works
CODEBLOCK0
Instructions
Step 1: Identify the Mutable File
The mutable file is the thing you're optimizing. It can be:
- - A SKILL.md prompt/instructions
- A trading strategy config (thresholds, parameters)
- A content template (YouTube script format, ad copy structure)
- Any text file where changes produce measurable differences
Create or identify this file. Example:
CODEBLOCK1
Step 2: Create an Evaluation Function
Your eval function must:
- 1. Take the current mutable file as input
- Run it against test cases
- Return a numeric score (higher = better)
The eval can be anything:
- - LLM-as-judge: Send output to an LLM, ask it to score 1-100
- Backtest: Run a strategy against historical data, measure Sharpe/returns
- A/B metrics: CTR, engagement, conversion rate
- Binary pass/fail: Count how many test cases pass out of N
Template eval function (customize for your domain):
CODEBLOCK2
Step 3: Run the Autoresearch Loop
The loop follows this exact pattern:
CODEBLOCK3
Agent Instructions for Running the Loop
When the user says "run autoresearch on X", follow this procedure:
- 1. Locate the mutable file — ask the user or infer from context
- Locate or create the eval function — the user must have a way to score
- Initialize git tracking in the project directory
- Run baseline eval — record the starting score
- Begin experiment loop:
- Read the mutable file
- Think about what single change might improve the score
- Make the change (be specific — change ONE thing per experiment)
- Run eval
- Keep or revert based on score
- Log the result
- 6. Continue for N experiments (default: 20, or until user stops)
- Report results:
- Starting score → Final score
- Number of experiments run
- Number of improvements kept
- Summary of what changes worked
Mutation Strategy
Good mutations change ONE thing at a time:
- - Numeric parameters: Adjust thresholds, weights, window sizes
- Prompt wording: Rephrase instructions, add/remove constraints
- Structure: Reorder sections, add examples, remove redundancy
- Rules: Add a new rule, tighten an existing one, relax a constraint
Bad mutations change everything at once — you can't learn what worked.
Step 4: Git Tracking
Every experiment MUST be tracked in git:
CODEBLOCK4
This gives you:
- - Full history of every experiment
- Ability to diff any two versions
- Easy rollback if something breaks
- A log of what mutations worked vs didn't
Proven Results
Case Study 1: Gold Trading Strategy
- - Task: Optimize XAUUSD trading parameters
- Mutable file: Strategy config (EMA periods, momentum threshold, position sizing)
- Eval function: Backtest on historical data → Sharpe ratio
- Baseline: Sharpe 5.80
- Experiments: 86 in 25 minutes
- Final: Sharpe 12.23 (+111%)
- Key discoveries: Momentum threshold 0.003→0, EMA 8/24→5/11, position sizing optimization
- See: INLINECODE0
Case Study 2: YouTube Shorts Scripts
- - Task: Optimize script-writing prompt for higher quality scores
- Mutable file: SKILL.md prompt instructions
- Eval function: LLM judge scoring 1-100
- Baseline: 94.3/100
- Experiments: 11
- Final: 96.7/100 (+2.5%)
- Key discoveries: Atomic sentences, strict 40-50 word range, stronger negative examples
- See: INLINECODE1
Example Usage
User: "Run autoresearch on my email subject line skill"
Agent workflow:
- 1. Read the skill's SKILL.md (mutable file)
- Create eval: generate 20 test emails → score subject lines with LLM judge (1-100 on open-rate prediction)
- Baseline: 72.4/100
- Experiment 1: Add "use numbers in subject lines" → 74.1 ✅ KEPT
- Experiment 2: Add "max 6 words" → 71.8 ❌ REVERTED
- Experiment 3: Add "start with a verb" → 75.3 ✅ KEPT
- ... continue for 20 experiments
- Final: 79.2/100 (+9.4%)
User: "Optimize my trading strategy config"
Agent workflow:
- 1. Read strategy.json (mutable file)
- Eval: run backtest script → Sharpe ratio
- Baseline: Sharpe 2.1
- Experiment 1: Lower stop-loss from 2% to 1.5% → Sharpe 2.3 ✅
- Experiment 2: Increase EMA fast period 12→15 → Sharpe 1.9 ❌
- ... continue
- Final: Sharpe 3.8 (+81%)
autoresearch
通过反复运行、根据二元评估对输出进行评分、变异提示词并保留改进,自主优化任何OpenClaw技能。基于Karpathy的autoresearch方法论。
触发条件
使用场景:优化此技能、改进此技能、运行autoresearch、让此技能更好、自我改进技能、基准测试技能、评估我的技能、运行评估。
描述
使用Karpathy的autoresearch模式进行自主提示词/策略优化。变异→评估→保留改进。适用于任何具有可测量分数的内容:交易策略、内容脚本、缩略图、广告文案、邮件主题。
工作原理
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 1. 基准线 │────▶│ 2. 变异 │────▶│ 3. 评估 │────▶│ 4. 决策 │
│ 对当前版本 │ │ 更改一个 │ │ 运行评分 │ │ 更好? │
│ 进行评分 │ │ 项目 │ │ 函数 │ │ 保留:回滚 │
└─────────────┘ └─────────────┘ └─────────────┘ └──────┬───────┘
│
循环回到第2步
操作说明
第1步:确定可变文件
可变文件是您要优化的对象。可以是:
- - SKILL.md提示词/指令
- 交易策略配置(阈值、参数)
- 内容模板(YouTube脚本格式、广告文案结构)
- 任何更改后能产生可测量差异的文本文件
创建或确定此文件。示例:
my-skill/
├── SKILL.md ← 这是您的可变文件
├── eval/
│ ├── test_cases.json
│ └── score.py
第2步:创建评估函数
您的评估函数必须:
- 1. 将当前可变文件作为输入
- 针对测试用例运行
- 返回一个数值分数(越高越好)
评估可以是任何形式:
- - LLM作为评判:将输出发送给LLM,要求其评分1-100
- 回测:针对历史数据运行策略,衡量夏普比率/回报率
- A/B指标:点击率、参与度、转化率
- 二元通过/失败:统计N个测试用例中通过的数量
模板评估函数(根据您的领域定制):
python
eval/score.py
import json
import sys
def evaluate(mutablefilepath: str, testcasespath: str) -> float:
对当前版本的可变文件进行评分。
返回一个浮点数——越高越好。
with open(mutablefilepath) as f:
current_version = f.read()
with open(testcasespath) as f:
test_cases = json.load(f)
scores = []
for case in test_cases:
# 在此处编写您的评分逻辑
# 示例:运行提示词,将输出与预期结果比较
score = runandscore(current_version, case)
scores.append(score)
return sum(scores) / len(scores)
if name == main:
score = evaluate(sys.argv[1], sys.argv[2])
print(f分数: {score})
第3步:运行Autoresearch循环
循环遵循以下精确模式:
- 1. Git初始化(如果尚未完成)——每个实验都是一个提交
- 对当前版本运行评估→获取基准分数
- 对于每个实验(1..N):
a. 读取当前可变文件
b. 生成一个变异(更改一个项目——阈值、短语、规则)
c. 写入变异后的版本
d. 运行评估→获取新分数
e. 如果新分数 > 基准分数:
- Git提交,消息为:exp-{N}: {描述} | 分数: {基准} → {新分数}
- 更新基准分数 = 新分数
- 记录:✅ 已保留——改进
f. 如果新分数 <= 基准分数:
- Git检出可变文件(回滚)
- 记录:❌ 已回滚——无改进
- 4. 打印最终摘要:运行的实验数、发现的改进、最终分数
运行循环的代理指令
当用户说对X运行autoresearch时,请遵循以下步骤:
- 1. 定位可变文件——询问用户或从上下文中推断
- 定位或创建评估函数——用户必须有一种评分方式
- 在项目目录中初始化Git跟踪
- 运行基准评估——记录起始分数
- 开始实验循环:
- 读取可变文件
- 思考哪一项更改可能提高分数
- 进行更改(要具体——每次实验只更改一项)
- 运行评估
- 根据分数保留或回滚
- 记录结果
- 6. 继续运行N个实验(默认:20,或直到用户停止)
- 报告结果:
- 起始分数→最终分数
- 运行的实验数量
- 保留的改进数量
- 哪些更改有效的摘要
变异策略
好的变异每次只更改一项:
- - 数值参数:调整阈值、权重、窗口大小
- 提示词措辞:改写指令、添加/删除约束
- 结构:重新排序章节、添加示例、删除冗余
- 规则:添加新规则、收紧现有规则、放宽约束
不好的变异一次更改所有内容——您无法了解哪些更改有效。
第4步:Git跟踪
每个实验都必须在Git中跟踪:
bash
开始前
git init
git add -A
git commit -m 基准: 分数 {X}
每次成功变异后
git add -A
git commit -m exp-{N}: {更改内容} | {旧分数} → {新分数}
每次失败变异后
git checkout -- {mutable_file}
这样您将获得:
- - 每个实验的完整历史记录
- 能够比较任意两个版本的差异
- 出现问题时可轻松回滚
- 哪些变异有效/无效的记录
已验证的结果
案例研究1:黄金交易策略
- - 任务:优化XAUUSD交易参数
- 可变文件:策略配置(EMA周期、动量阈值、头寸规模)
- 评估函数:对历史数据进行回测→夏普比率
- 基准:夏普比率5.80
- 实验:25分钟内完成86次
- 最终:夏普比率12.23(+111%)
- 关键发现:动量阈值0.003→0,EMA 8/24→5/11,头寸规模优化
- 参见:references/gold-results.md
案例研究2:YouTube短视频脚本
- - 任务:优化脚本编写提示词以获得更高质量分数
- 可变文件:SKILL.md提示词指令
- 评估函数:LLM评判评分1-100
- 基准:94.3/100
- 实验:11次
- 最终:96.7/100(+2.5%)
- 关键发现:原子句、严格40-50词范围、更强的负面示例
- 参见:references/youtube-results.md
使用示例
用户:对我的邮件主题行技能运行autoresearch
代理工作流程:
- 1. 读取技能的SKILL.md(可变文件)
- 创建评估:生成20封测试邮件→使用LLM评判对主题行评分(1-100,基于打开率预测)
- 基准:72.4/100
- 实验1:添加在主题行中使用数字→74.1 ✅ 已保留
- 实验2:添加最多6个词→71.8 ❌ 已回滚
- 实验3:添加以动词开头→75.3 ✅ 已保留
- ...继续运行20个实验
- 最终:79.2/100(+9.4%)
用户:优化我的交易策略配置
代理工作流程:
- 1. 读取strategy.json(可变文件)
- 评估:运行回测脚本→夏普比率
- 基准:夏普比率2.1
- 实验1:将止损从2%降低到1.5%→夏普比率2.3 ✅
- 实验2:将EMA快速周期从12增加到15→夏普比率1.9 ❌
- ...继续
- 最终:夏普比率3.8(+81%)