autoresearch

Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology.

Triggers

Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on.

Description

Autonomous prompt/strategy optimization using Karpathy's autoresearch pattern. Mutate → evaluate → keep improvements. Works on anything with a measurable score: trading strategies, content scripts, thumbnails, ad copy, email subjects.

How It Works

CODEBLOCK0

Instructions

Step 1: Identify the Mutable File

The mutable file is the thing you're optimizing. It can be:

- A SKILL.md prompt/instructions
A trading strategy config (thresholds, parameters)
A content template (YouTube script format, ad copy structure)
Any text file where changes produce measurable differences

Create or identify this file. Example:
CODEBLOCK1

Step 2: Create an Evaluation Function

Your eval function must:

1. Take the current mutable file as input
Run it against test cases
Return a numeric score (higher = better)

The eval can be anything:

- LLM-as-judge: Send output to an LLM, ask it to score 1-100
Backtest: Run a strategy against historical data, measure Sharpe/returns
A/B metrics: CTR, engagement, conversion rate
Binary pass/fail: Count how many test cases pass out of N

Template eval function (customize for your domain):
CODEBLOCK2

Step 3: Run the Autoresearch Loop

The loop follows this exact pattern:

CODEBLOCK3

Agent Instructions for Running the Loop

When the user says "run autoresearch on X", follow this procedure:

1. Locate the mutable file — ask the user or infer from context
Locate or create the eval function — the user must have a way to score
Initialize git tracking in the project directory
Run baseline eval — record the starting score
Begin experiment loop:

- Read the mutable file - Think about what single change might improve the score - Make the change (be specific — change ONE thing per experiment) - Run eval - Keep or revert based on score - Log the result

6. Continue for N experiments (default: 20, or until user stops)
Report results:

- Starting score → Final score - Number of experiments run - Number of improvements kept - Summary of what changes worked

Mutation Strategy

Good mutations change ONE thing at a time:

- Numeric parameters: Adjust thresholds, weights, window sizes
Prompt wording: Rephrase instructions, add/remove constraints
Structure: Reorder sections, add examples, remove redundancy
Rules: Add a new rule, tighten an existing one, relax a constraint

Bad mutations change everything at once — you can't learn what worked.

Step 4: Git Tracking

Every experiment MUST be tracked in git:
CODEBLOCK4

This gives you:

- Full history of every experiment
Ability to diff any two versions
Easy rollback if something breaks
A log of what mutations worked vs didn't

Proven Results

Case Study 1: Gold Trading Strategy

- Task: Optimize XAUUSD trading parameters
Mutable file: Strategy config (EMA periods, momentum threshold, position sizing)
Eval function: Backtest on historical data → Sharpe ratio
Baseline: Sharpe 5.80
Experiments: 86 in 25 minutes
Final: Sharpe 12.23 (+111%)
Key discoveries: Momentum threshold 0.003→0, EMA 8/24→5/11, position sizing optimization
See: INLINECODE0

Case Study 2: YouTube Shorts Scripts

- Task: Optimize script-writing prompt for higher quality scores
Mutable file: SKILL.md prompt instructions
Eval function: LLM judge scoring 1-100
Baseline: 94.3/100
Experiments: 11
Final: 96.7/100 (+2.5%)
Key discoveries: Atomic sentences, strict 40-50 word range, stronger negative examples
See: INLINECODE1

Example Usage

User: "Run autoresearch on my email subject line skill"

Agent workflow:

1. Read the skill's SKILL.md (mutable file)
Create eval: generate 20 test emails → score subject lines with LLM judge (1-100 on open-rate prediction)
Baseline: 72.4/100
Experiment 1: Add "use numbers in subject lines" → 74.1 ✅ KEPT
Experiment 2: Add "max 6 words" → 71.8 ❌ REVERTED
Experiment 3: Add "start with a verb" → 75.3 ✅ KEPT
... continue for 20 experiments
Final: 79.2/100 (+9.4%)

User: "Optimize my trading strategy config"

Agent workflow:

1. Read strategy.json (mutable file)
Eval: run backtest script → Sharpe ratio
Baseline: Sharpe 2.1
Experiment 1: Lower stop-loss from 2% to 1.5% → Sharpe 2.3 ✅
Experiment 2: Increase EMA fast period 12→15 → Sharpe 1.9 ❌
... continue
Final: Sharpe 3.8 (+81%)

autoresearch

通过反复运行、根据二元评估对输出进行评分、变异提示词并保留改进，自主优化任何OpenClaw技能。基于Karpathy的autoresearch方法论。

触发条件

使用场景：优化此技能、改进此技能、运行autoresearch、让此技能更好、自我改进技能、基准测试技能、评估我的技能、运行评估。

描述

使用Karpathy的autoresearch模式进行自主提示词/策略优化。变异→评估→保留改进。适用于任何具有可测量分数的内容：交易策略、内容脚本、缩略图、广告文案、邮件主题。

工作原理

┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 1. 基准线 │────▶│ 2. 变异 │────▶│ 3. 评估 │────▶│ 4. 决策 │
│ 对当前版本 │ │ 更改一个 │ │ 运行评分 │ │ 更好？ │
│ 进行评分 │ │ 项目 │ │ 函数 │ │ 保留：回滚 │
└─────────────┘ └─────────────┘ └─────────────┘ └──────┬───────┘
│
循环回到第2步

操作说明

第1步：确定可变文件

可变文件是您要优化的对象。可以是：

- SKILL.md提示词/指令
交易策略配置（阈值、参数）
内容模板（YouTube脚本格式、广告文案结构）
任何更改后能产生可测量差异的文本文件

创建或确定此文件。示例：

my-skill/
├── SKILL.md ← 这是您的可变文件
├── eval/
│ ├── test_cases.json
│ └── score.py

第2步：创建评估函数

您的评估函数必须：

1. 将当前可变文件作为输入
针对测试用例运行
返回一个数值分数（越高越好）

评估可以是任何形式：

- LLM作为评判：将输出发送给LLM，要求其评分1-100
回测：针对历史数据运行策略，衡量夏普比率/回报率
A/B指标：点击率、参与度、转化率
二元通过/失败：统计N个测试用例中通过的数量

模板评估函数（根据您的领域定制）：
python

eval/score.py

import json
import sys

def evaluate(mutablefilepath: str, testcasespath: str) -> float:

对当前版本的可变文件进行评分。
返回一个浮点数——越高越好。

with open(mutablefilepath) as f:
current_version = f.read()

with open(testcasespath) as f:
test_cases = json.load(f)

scores = []
for case in test_cases:
# 在此处编写您的评分逻辑
# 示例：运行提示词，将输出与预期结果比较
score = runandscore(current_version, case)
scores.append(score)

return sum(scores) / len(scores)

if name == main:
score = evaluate(sys.argv[1], sys.argv[2])
print(f分数: {score})

第3步：运行Autoresearch循环

循环遵循以下精确模式：

1. Git初始化（如果尚未完成）——每个实验都是一个提交
对当前版本运行评估→获取基准分数
对于每个实验（1..N）：

a. 读取当前可变文件 b. 生成一个变异（更改一个项目——阈值、短语、规则） c. 写入变异后的版本 d. 运行评估→获取新分数 e. 如果新分数 > 基准分数： - Git提交，消息为：exp-{N}: {描述} | 分数: {基准} → {新分数} - 更新基准分数 = 新分数 - 记录：✅ 已保留——改进 f. 如果新分数 <= 基准分数： - Git检出可变文件（回滚） - 记录：❌ 已回滚——无改进

4. 打印最终摘要：运行的实验数、发现的改进、最终分数

运行循环的代理指令

当用户说对X运行autoresearch时，请遵循以下步骤：

1. 定位可变文件——询问用户或从上下文中推断
定位或创建评估函数——用户必须有一种评分方式
在项目目录中初始化Git跟踪
运行基准评估——记录起始分数
开始实验循环：

- 读取可变文件 - 思考哪一项更改可能提高分数 - 进行更改（要具体——每次实验只更改一项） - 运行评估 - 根据分数保留或回滚 - 记录结果

6. 继续运行N个实验（默认：20，或直到用户停止）
报告结果：

- 起始分数→最终分数 - 运行的实验数量 - 保留的改进数量 - 哪些更改有效的摘要

变异策略

好的变异每次只更改一项：

- 数值参数：调整阈值、权重、窗口大小
提示词措辞：改写指令、添加/删除约束
结构：重新排序章节、添加示例、删除冗余
规则：添加新规则、收紧现有规则、放宽约束

不好的变异一次更改所有内容——您无法了解哪些更改有效。

第4步：Git跟踪

每个实验都必须在Git中跟踪：
bash

开始前

git init
git add -A
git commit -m 基准: 分数 {X}

每次成功变异后

git add -A git commit -m exp-{N}: {更改内容} | {旧分数} → {新分数}

每次失败变异后

git checkout -- {mutable_file}

这样您将获得：

- 每个实验的完整历史记录
能够比较任意两个版本的差异
出现问题时可轻松回滚
哪些变异有效/无效的记录

已验证的结果

案例研究1：黄金交易策略

- 任务：优化XAUUSD交易参数
可变文件：策略配置（EMA周期、动量阈值、头寸规模）
评估函数：对历史数据进行回测→夏普比率
基准：夏普比率5.80
实验：25分钟内完成86次
最终：夏普比率12.23（+111%）
关键发现：动量阈值0.003→0，EMA 8/24→5/11，头寸规模优化
参见：references/gold-results.md

案例研究2：YouTube短视频脚本

- 任务：优化脚本编写提示词以获得更高质量分数
可变文件：SKILL.md提示词指令
评估函数：LLM评判评分1-100
基准：94.3/100
实验：11次
最终：96.7/100（+2.5%）
关键发现：原子句、严格40-50词范围、更强的负面示例
参见：references/youtube-results.md

使用示例

用户：对我的邮件主题行技能运行autoresearch

代理工作流程：

1. 读取技能的SKILL.md（可变文件）
创建评估：生成20封测试邮件→使用LLM评判对主题行评分（1-100，基于打开率预测）
基准：72.4/100
实验1：添加在主题行中使用数字→74.1 ✅ 已保留
实验2：添加最多6个词→71.8 ❌ 已回滚
实验3：添加以动词开头→75.3 ✅ 已保留
...继续运行20个实验
最终：79.2/100（+9.4%）

用户：优化我的交易策略配置

代理工作流程：

1. 读取strategy.json（可变文件）
评估：运行回测脚本→夏普比率
基准：夏普比率2.1
实验1：将止损从2%降低到1.5%→夏普比率2.3 ✅
实验2：将EMA快速周期从12增加到15→夏普比率1.9 ❌
...继续
最终：夏普比率3.8（+81%）

autoresearch自动优化技能

autoresearch

autoresearch

Triggers

Description

How It Works

Instructions

Step 1: Identify the Mutable File

Step 2: Create an Evaluation Function

Step 3: Run the Autoresearch Loop

Agent Instructions for Running the Loop

Mutation Strategy

Step 4: Git Tracking

Proven Results

Case Study 1: Gold Trading Strategy

Case Study 2: YouTube Shorts Scripts

Example Usage

autoresearch

触发条件

描述

工作原理

操作说明

第1步：确定可变文件

第2步：创建评估函数

eval/score.py

第3步：运行Autoresearch循环

运行循环的代理指令

变异策略

第4步：Git跟踪

开始前

每次成功变异后

每次失败变异后

已验证的结果

案例研究1：黄金交易策略

案例研究2：YouTube短视频脚本

使用示例

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement