Autoimprove

Autonomous optimization loop. Two phases: interactive setup (confirms scope, reviews tests, establishes baseline) then headless execution (modify, check, keep or discard, repeat). First runs are always interactive. Only run headless after you've verified the setup works.

When to Use

- User wants to optimize something measurable (speed, size, accuracy, cost)
User has an improve.md file
User says "autoimprove", "optimize", "improve", "make faster/smaller/better"
User wants overnight autonomous optimization

Commands

Command	What it does
INLINECODE1	Auto-detect what's needed, set up missing pieces, then run the loop
INLINECODE2

Same, but use a specific improve.md | | /autoimprove --export | Generate agent-agnostic program.md |

Individual setup steps can also be run standalone:

Command	What it does
INLINECODE5	Scaffold an improve.md (auto-detects repo type)
INLINECODE6

But you don't need to run these separately. /autoimprove detects what's missing and walks you through it.

The `improve.md` Format

A single markdown file — part config, part prompt:

CODEBLOCK0

The Loop

When invoked, follow this protocol exactly:

Step 1: Readiness Check (auto-guided setup)

Run through these checks in order. If anything is missing, offer to fix it inline
rather than stopping. The user should be able to go from zero to running with a
single /autoimprove invocation.

CODEBLOCK1

The key principle: detect, offer, fix, continue. Don't stop and tell the user to run a different command. Walk them through setup inline and only block on things that require human judgment (uncommitted changes, golden set labeling).

Step 1a: Parse

Read the improve.md file. Extract all structured fields from the headers. Everything after ## Instructions is the domain prompt.

Step 1b: Resolve Scope

Read the Change.scope field:

- If it contains explicit paths or globs: resolve directly
If it contains natural language: scan the codebase, identify matching files, present for confirmation
Apply Change.exclude to filter
Apply default safety excludes (unless explicitly included in scope):

- Secrets: .env, .env.*, *.pem, *.key, credentials.*, auth.*, secrets.*
- Infrastructure: .git/, .autoimprove/, node_modules/, vendor/
- CI/CD: .github/workflows/, .gitlab-ci.yml, Jenkinsfile
- Eval artifacts: paths matching eval/, golden_set, test fixtures

- Once confirmed, the resolved file list is LOCKED for the entire loop

Step 2.5: Bootstrap (when invoked via `/autoimprove bootstrap`)

This is a separate pre-loop phase. It generates the test suite needed for safe optimization.

Tests are mutable during bootstrap, immutable during the loop. These two phases never mix.

Bootstrap Protocol

1. Analyze: Read all files in the resolved Change.scope AND the improve.md. Identify:

- The optimization goal (Check.goal) and what the agent will be chasing - Public API surface (exported functions, methods, classes) - Critical code paths (hot loops, core logic, data transformations) - Edge cases (nil/null handling, empty inputs, boundary values) - Existing test coverage (check Check.test-files if specified)

2. Goal-aware threat modeling: The optimization goal predicts what the agent will try, which predicts what it will break. Generate tests that guard against the failure modes of THAT specific goal:

When goal = faster (lower latency, fewer allocations):
The agent will skip work, take shortcuts, and remove safety checks.
- Unicode/multibyte input still works (fast paths assume ASCII)
- Empty, nil, zero-length inputs don't crash (nil checks removed for speed)
- Error messages are still correct and informative (error formatting skipped)
- Concurrent access is safe (locks removed for throughput)
- Large inputs don't OOM or hang (buffer reuse may break on edge sizes)
- Output is bit-for-bit identical to baseline (fast path might truncate or round)

When goal = smaller (image size, bundle size):
The agent will remove things, strip features, and swap dependencies.
- All features still work at runtime (removed code might have been needed)
- Lazy-loaded routes/components still render (code splitting may break references)
- Runtime dependencies are present (dev deps removed, but some were runtime)
- Static assets still load (paths may change with restructuring)
- App starts successfully and passes health checks

When goal = higher accuracy (ML, prompt quality):
The agent will overfit, leak data, and add complexity.
- No test/validation data leaks into training (train/test split integrity)
- Pipeline is reproducible with fixed seeds (randomness controlled)
- Missing values, outliers, and edge inputs handled correctly
- Model outputs are valid (probabilities sum to 1, no NaN predictions)
- Feature engineering doesn't introduce future data leakage (no lookahead)
- Predictions work on single samples, not just batches

When goal = lower cost (infra, compute):
The agent will downsize, reduce redundancy, and cut resources.
- Service still handles expected load (reduced replicas may not suffice)
- Failover still works (redundancy removed)
- Latency stays within acceptable bounds (smaller instances = slower)
- Health checks pass under load, not just at idle
- Data durability unchanged (storage reductions may lose backups)

When goal = higher score (prompt engineering, config tuning):
The agent will game the metric and overfit to the eval set.
- Output format is consistent (not just correct for eval cases)
- Edge inputs produce reasonable output (not just golden set inputs)
- Output length/verbosity stays reasonable (gaming token limits)
- No hardcoded responses for known eval cases

3. Gap analysis: Cross-reference existing tests with the goal-aware threat model:

- Critical gaps: Failure modes with NO test coverage — these must be filled - Weak coverage: Tests exist but don't cover edge cases for this goal - Sufficient: Already tested well for this optimization direction

4. Generate (when --generate is passed):

- Write test files to the paths specified in Check.test-files - Prioritize tests for critical gaps first, weak coverage second - Run the tests to confirm they pass on the current unmodified code - If any test fails, fix it — tests must pass on the CURRENT code before optimization - Present the generated tests for review, grouped by threat category

5. Commit: After human review, commit the test files:

INLINECODE40

6. Report: Print a readiness summary:

CODEBLOCK2

What makes a good test suite for autoimprove

The tests don't need to be exhaustive — they need to catch the specific kinds of breakage
that an agent chasing YOUR optimization goal is likely to introduce.

The key insight: the goal tells you what the agent will try, and that tells you what to test.

Tests SHOULD:

- Guard against the failure modes predicted by the goal (see threat model above)
Verify output equivalence: same input produces same output regardless of internals
Cover edge cases the agent's fast paths will skip: empty, nil, unicode, huge, concurrent
Assert on API contracts: signatures, return types, side effects
Be fast — slow tests make the loop slower and waste experiment budget

Tests should NOT:

- Assert on performance (that's what the score is for)
Assert on internal implementation details (the agent SHOULD change those)
Be flaky or timing-dependent (false failures poison the loop)
Be so numerous they dominate experiment time (keep test suite under 30s)

Step 2.6: Eval Init (when invoked via `/autoimprove eval-init`)

This phase scaffolds the evaluation harness — the check command and scoring mechanism.
Not every domain needs this. Use the table below to decide.

Which domains need eval scaffolding?

Domain	Needs golden set?	Needs eval script?	Why
INLINECODE42	Yes	Yes	"Good answer" is subjective — needs labeled Q&A pairs
INLINECODE43

Yes | Yes | Output quality requires human-labeled expected outputs | | automl | Yes | Yes | Model accuracy needs labeled train/test data | | ml | No | Usually exists | Training scripts typically emit loss/metrics already | | perf | No | No | Benchmarks produce objective numbers | | docker | No | No | Image size in bytes is objective | | frontend | No | No | Bundle size in bytes is objective | | ci | No | No | Build time is objective | | sql | No | No | Query time is objective | | k8s | No | No | Pod health is objective |

Eval Init Protocol (for domains that need it)

1. Detect domain: Read the codebase and improve.md to determine the domain type.

2. Scaffold eval script: Generate a skeleton evaluation script that:

- Imports the system under test - Loads a golden set from a JSON file - Runs each test case through the system - Computes appropriate metrics for the domain: - Search/RAG: precision, recall, MRR, NDCG, answer relevancy - ML/AutoML: AUC-ROC, F1, accuracy, confusion matrix - Prompts: F1, exact match, semantic similarity - Prints INLINECODE53

3. Build golden set interactively:

- Run the system with sample inputs from the domain - Present the outputs to the user - Ask: "Is this a good result? What should the correct output be?" - Save labeled results as the golden set - Aim for 20-50 test cases (enough for statistical significance, few enough to run fast)

4. Validate golden set: After creation, verify that:

- Every expected result actually exists in the data/system (no impossible expectations) - The eval script runs without errors on the current code - The baseline score is reasonable (not 0.0 or 1.0 — both suggest a broken eval) - Error rate on the baseline run is below 20% (above that, the eval is unreliable)

5. Report:

CODEBLOCK3

Step 3: Baseline

Run the check command. Extract the score.

Pre-loop error check: If the eval produces any errors (non-zero error rate), report them:
CODEBLOCK4

If error rate > 20%, REFUSE to start the loop. Too many failures make the score unreliable.

Save baseline to .autoimprove/baseline.json:

CODEBLOCK5

Print: INLINECODE55

The budget timer starts NOW, not during setup. Bootstrap, eval-init, and baseline establishment are excluded from the budget.

Step 4: Loop

CODEBLOCK6

Step 5: Summary

When the loop ends, print:

CODEBLOCK7

Stopping Conditions

The loop stops when ANY of these is true:

- budget time has elapsed (measured from first experiment, not from setup)
INLINECODE57 experiments have been run
Score has reached INLINECODE58
INLINECODE59 consecutive experiments failed to improve
Manual interrupt (Ctrl+C or agent termination)

Score Extraction

Try these in order:

1. Convention: look for SCORE: <number> in stdout
Regex: if score: field contains a regex pattern (has parentheses), apply it to stdout
jq: if score: field starts with ., treat as jq expression applied to stdout as JSON

Guard Metrics

Optional secondary metrics that must not regress. Format in improve.md:

CODEBLOCK8

This means: extract error_rate from stdout using the regex, and reject the experiment if the value exceeds 0.05. Guard failures are logged as "guard_failed" and count toward consecutive failures.

Use guards to prevent the agent from improving one metric by tanking another. Common examples:

- guard: error_rate: ([\d.]+) < 0.05 — search errors stay below 5%
INLINECODE67 — tail latency stays under 500ms
INLINECODE68 — memory usage stays under 1GB

Experiment Logging

Each experiment is saved as JSON in .autoimprove/experiments/:

CODEBLOCK9

The `supersedes` field

When an experiment fundamentally replaces the approach from a previous experiment, set supersedes to the list of experiment IDs that are now obsolete:

CODEBLOCK10

When reading past experiments, skip any whose ID appears in a later kept experiment's supersedes list. This prevents wasting rounds on variations of discarded approaches.

Prerequisites and security

Runtime requirements: git is required. The check commands in your improve.md determine what else is needed (go, python, npm, docker, kubectl, psql, etc.). Verify these are installed before starting.

Credentials: The agent runs arbitrary shell commands from your improve.md. It inherits whatever credentials are available to the process (AWS keys, DB creds, kubeconfigs, API tokens). Run autoimprove with least-privilege credentials. Strip environment variables you don't want the agent to access.

First run: Always interactive. The readiness check (Step 1) confirms scope, reviews generated tests, and establishes a baseline before the loop starts. Don't run headless until you've verified one interactive run works correctly.

Backup: Before headless runs, the readiness check creates a backup branch automatically. The loop uses git commits and resets for rollback, but the backup branch protects against edge cases.

Scope enforcement: The rules below (NEVER modify files outside scope) are policy constraints, not technical enforcement. The agent follows them in practice, but there is no sandbox preventing out-of-scope edits. For sensitive repos, run in a cloned fork or container where damage is reversible.

Rules

- NEVER modify files outside the resolved INLINECODE73
NEVER modify test files during the optimization loop — tests are immutable guardrails
NEVER modify the check command, eval script, golden set, or scoring mechanism
NEVER skip the git commit before running the check
NEVER proceed to eval without verifying HEAD changed (confirms commit succeeded)
ALWAYS log every experiment, including failures and errors
ALWAYS read past experiments before proposing a new change
ALWAYS skip superseded experiments when reading the log
ALWAYS git reset discarded experiments — leave the tree clean
Test failure means immediate discard — no exceptions, no "but the score improved"
Guard failure means immediate discard — secondary metrics must not regress
If many experiments fail tests, the change strategy is wrong — try a different approach
Complexity must pay for itself: 20 lines of hack for 0.001 improvement is NOT worth keeping
Deleting code for equal results IS worth keeping (use keepifequal: true)

Init Templates

When the user runs /autoimprove init, detect the repo type and suggest the right template. If --type is specified, use that directly.

Detection heuristics:

- go.mod or Cargo.toml or Makefile with benchmark targets → INLINECODE79
INLINECODE80 or *.ipynb with torch/tensorflow imports → INLINECODE82
INLINECODE83 with sklearn/xgboost/lightgbm/catboost imports → INLINECODE84
Python files with langchain/llama_index/chromadb/pinecone/weaviate/qdrant imports → INLINECODE85
INLINECODE86 → INLINECODE87
INLINECODE88 with kind: Deployment → INLINECODE90
INLINECODE91 directory or eval/ with score outputs → INLINECODE93
INLINECODE94 files → INLINECODE95
INLINECODE96 with build script → INLINECODE97
INLINECODE98 → INLINECODE99

For domains that need eval scaffolding (rag, prompt, automl), also suggest running /autoimprove eval-init after init.

See references/examples.md for all templates.

Export Mode

When /autoimprove --export is invoked, generate a program.md file that any agent can follow without the skill. This makes autoimprove agent-agnostic.

See references/protocol.md for the exported protocol template.

Multi-Shot Examples

The following examples demonstrate how autoimprove applies across domains. Use these to understand the SHAPE of good experiments — what kinds of changes to try, how to structure the check command, and what good improve.md instructions look like.

Example 1: API Latency

CODEBLOCK11

Example 2: Docker Image Size

CODEBLOCK12

Example 3: LLM Prompt Quality

CODEBLOCK13

Example 4: Build/CI Speed

CODEBLOCK14

Example 5: ML Training (Karpathy-style)

CODEBLOCK15

Example 6: Kubernetes Cluster Health

CODEBLOCK16

Example 7: SQL Query Performance

CODEBLOCK17

Example 8: Frontend Bundle Size

CODEBLOCK18

Example 9: Tabular ML / AutoML (Churn, Fraud, Scoring)

This is the most common ML task across companies — predicting outcomes on structured data. Traditional AutoML (AutoSklearn, FLAML, AutoGluon) searches a predefined hyperparameter grid. Autoimprove goes further: it can engineer new features, rewrite preprocessing, swap model architectures, and delete dead code.

CODEBLOCK19

This example works for any tabular prediction task — swap "churn" for fraud detection,
credit scoring, lead conversion, demand forecasting, or insurance pricing. The structure
is the same: features in columns, a target to predict, a metric to maximize.

Example 10: RAG Pipeline Optimization

RAG (Retrieval-Augmented Generation) pipelines have many interacting knobs — chunking,
embedding, retrieval, reranking, prompt template, context window management. Small
changes compound: better chunking improves retrieval which improves generation quality.
Autoimprove can explore this space much faster than manual tuning.

CODEBLOCK20

This works for any RAG system — internal knowledge bases, customer support bots,
documentation search, legal document retrieval, or code search. The score metric
can be swapped: use faithfulness to reduce hallucination, context_precision to
improve retrieval, or a composite RAGAS score for overall quality.

Autoimprove

自主优化循环。分为两个阶段：交互式设置（确认范围、审查测试、建立基线），然后是无头执行（修改、检查、保留或丢弃、重复）。首次运行始终为交互式。只有在验证设置正常工作后，才能进行无头运行。

何时使用

- 用户想要优化可衡量的指标（速度、大小、准确性、成本）
用户有一个 improve.md 文件
用户说autoimprove、optimize、improve、make faster/smaller/better
用户想要进行隔夜自主优化

命令

命令	功能
/autoimprove	自动检测所需内容，设置缺失的部分，然后运行循环
/autoimprove <路径>

同上，但使用特定的 improve.md | | /autoimprove --export | 生成与代理无关的 program.md |

各个设置步骤也可以单独运行：

命令	功能
/autoimprove init	搭建 improve.md 框架（自动检测仓库类型）
/autoimprove init --type <类型>

但您不需要单独运行这些命令。/autoimprove 会检测缺失的内容并引导您完成。

improve.md 格式

一个单一的 Markdown 文件——部分配置，部分提示：

markdown

autoimprove: <名称>

Change

scope: <自然语言描述、显式路径或通配符> exclude: <永远不修改的路径（可选）>

Check

test: <验证正确性的命令——任何实验要保留都必须通过> test-files: <测试文件的路径——循环期间只读，仅在引导阶段可变> run: <产生分数的命令——仅在测试通过时运行> score: <如何提取数字——SCORE: {值} 或正则表达式或 jq> goal: guard: <可选——不得退化的指标的正则表达式和阈值，例如 error_rate: ([\d.]+) < 0.05> keepifequal: timeout: <每个实验的最长时间>

Stop

budget: <总挂钟时间限制——从第一个实验开始计时，而非从设置开始> rounds: <最大实验次数> target: <当分数达到此值时停止> stale: <连续 N 次失败后停止>

Agent

provider: model: <要使用的模型>

Instructions

<自由格式：要尝试什么、要避免什么、领域知识>

循环

调用时，请严格按照以下协议执行：

步骤 1：就绪检查（自动引导设置）

按顺序执行这些检查。如果缺少任何内容，请提供内联修复，而不是停止。用户应该能够通过一次 /autoimprove 调用从零开始运行。

正在检查就绪状态...

1. improve.md

✓ 找到 improve.md — 或 — ✗ 未找到 improve.md。 → 检测仓库类型，提供搭建框架的选项：这看起来像是一个 [类型] 项目。创建 improve.md？[y/n] → 内联运行 init 协议 → 继续下一个检查

2. 范围解析

✓ 将范围模板解析引擎解析为： - lib/liquid/parser.rb - lib/liquid/lexer.rb - lib/liquid/variable.rb 这些是唯一会被修改的文件。确认？[y/n] — 或 — ✗ 范围解析为 0 个文件。 → 要求用户澄清范围

3. 评估框架

✓ 检查命令成功运行 — 或 — ✗ 没有检查命令，或者它在未修改的代码上失败。 → 检测领域是否需要黄金集（rag、prompt、automl） → 如果是：内联运行 eval-init 协议（交互式搭建评估 + 黄金集） → 如果否：帮助用户编写检查命令 → 继续下一个检查

4. 测试套件

✓ 测试通过（2.1 秒内通过 16 项） — 或 — ✗ 没有测试命令，或者测试失败。 → 内联运行 bootstrap 协议（目标感知的测试生成） → 呈现测试供审查，提交它们 → 继续下一个检查

5. Git 状态

✓ 工作树是干净的 — 或 — ✗ 有未提交的更改。 → 您有未提交的更改。在 autoimprove 启动之前，请提交或暂存它们。 → 这是唯一无法自动修复的障碍。停止并等待。

6. 备份

→ git branch autoimprove-backup-$(date +%Y%m%d-%H%M) → 打印：备份分支已创建。

7. 基线

✓ 基线分数：0.4398（错误率：0.0%） — 或 — ✗ 错误率 > 20% → 列出失败的查询。在开始之前修复这些错误，否则分数不可靠。 → 停止并等待。

就绪。正在启动优化循环。

关键原则：检测、提供、修复、继续。 不要停下来告诉用户运行不同的命令。内联引导他们完成设置，仅在需要人工判断的事情上阻塞（未提交的更改、黄金集标记）。

步骤 1a：解析

读取 improve.md 文件。从头信息中提取所有结构化字段。## Instructions 之后的所有内容都是领域提示。

步骤 1b：解析范围

读取 Change.scope 字段：

- 如果包含显式路径或通配符：直接解析
如果包含自然语言：扫描代码库，识别匹配的文件，呈现以供确认
应用 Change.exclude 进行过滤
应用默认的安全排除项（除非在范围内显式包含）：

- 密钥：.env、.env.、.pem、.key、credentials.、auth.、secrets.
- 基础设施：.git/、.autoimprove/、node_modules/、vendor/
- CI/CD：.github/workflows/、.gitlab-ci.yml、Jenkinsfile
- 评估产物：匹配 eval/、golden_set、测试夹具的路径

- 一旦确认，解析后的文件列表在整个循环中被锁定

步骤 2.5：引导（通过 /autoimprove bootstrap 调用时）

这是一个单独的循环前阶段。它生成安全优化所需的测试套件。

测试在引导阶段可变，在循环期间不可变。 这两个阶段从不混合。

引导协议

1. 分析：读取解析后的 Change.scope 中的所有文件以及 improve.md。识别：

- 优化目标（Check.goal）以及代理将追求什么 - 公共 API 表面（导出的函数、方法、类） - 关键代码路径（热循环、核心逻辑、数据转换） - 边缘情况（nil/null 处理、空输入、边界值） - 现有测试覆盖率（如果指定，检查 Check.test-files）

2. 目标感知的威胁建模：优化目标预测代理将尝试什么，进而预测它将破坏什么。生成针对该特定目标的失败模式进行防护的测试：

当目标 = 更快（更低延迟、更少分配）：
代理将跳过工作、走捷径并移除安全检查。
- Unicode/多字节输入仍然有效（快速路径假设为 ASCII）
- 空、nil、零长度输入不会崩溃（为速度移除了 nil 检查）
- 错误消息仍然正确且信息丰富（跳过了错误格式化）
- 并发访问是安全的（为吞吐量移除了锁）
- 大输入不会 OOM 或挂起（缓冲区重用可能在边缘大小上失效）
- 输出与基线逐位相同（快速路径可能截断或四舍五入）

当目标 = 更小（图像大小、包大小）：
代理将移除内容、剥离功能并交换依赖项。
- 所有功能在运行时仍然有效（移除的代码可能曾是必需的）
- 懒加载的路由/组件仍然渲染（代码拆分可能破坏引用）
- 运行时依赖项存在（移除了开发依赖项，但有些是运行时依赖项）
- 静态资源仍然加载（路径可能随重构而变化）
- 应用成功启动并通过健康检查

当目标 = 更高准确性（ML、提示质量）：
代理将过拟合

autoimprove自动优化

autoimprove

Autoimprove

When to Use

Commands

The improve.md Format

The Loop

Step 1: Readiness Check (auto-guided setup)

Step 1a: Parse

Step 1b: Resolve Scope

Step 2.5: Bootstrap (when invoked via /autoimprove bootstrap)

Bootstrap Protocol

What makes a good test suite for autoimprove

Step 2.6: Eval Init (when invoked via /autoimprove eval-init)

Which domains need eval scaffolding?

Eval Init Protocol (for domains that need it)

Step 3: Baseline

Step 4: Loop

Step 5: Summary

Stopping Conditions

Score Extraction

Guard Metrics

Experiment Logging

The supersedes field

Prerequisites and security

Rules

Init Templates

Export Mode

Multi-Shot Examples

Example 1: API Latency

Example 2: Docker Image Size

Example 3: LLM Prompt Quality

Example 4: Build/CI Speed

Example 5: ML Training (Karpathy-style)

Example 6: Kubernetes Cluster Health

Example 7: SQL Query Performance

Example 8: Frontend Bundle Size

Example 9: Tabular ML / AutoML (Churn, Fraud, Scoring)

Example 10: RAG Pipeline Optimization

Autoimprove

何时使用

命令

improve.md 格式

autoimprove: <名称>

Change

Check

Stop

Agent

Instructions

循环

步骤 1：就绪检查（自动引导设置）

步骤 1a：解析

步骤 1b：解析范围

步骤 2.5：引导（通过 /autoimprove bootstrap 调用时）

引导协议

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement

The `improve.md` Format

Step 2.5: Bootstrap (when invoked via `/autoimprove bootstrap`)

Step 2.6: Eval Init (when invoked via `/autoimprove eval-init`)

The `supersedes` field