Autoimprove
Autonomous optimization loop. Two phases: interactive setup (confirms scope, reviews tests, establishes baseline) then headless execution (modify, check, keep or discard, repeat). First runs are always interactive. Only run headless after you've verified the setup works.
When to Use
- - User wants to optimize something measurable (speed, size, accuracy, cost)
- User has an
improve.md file - User says "autoimprove", "optimize", "improve", "make faster/smaller/better"
- User wants overnight autonomous optimization
Commands
| Command | What it does |
|---|
| INLINECODE1 | Auto-detect what's needed, set up missing pieces, then run the loop |
| INLINECODE2 |
Same, but use a specific improve.md |
|
/autoimprove --export | Generate agent-agnostic
program.md |
Individual setup steps can also be run standalone:
| Command | What it does |
|---|
| INLINECODE5 | Scaffold an improve.md (auto-detects repo type) |
| INLINECODE6 |
Scaffold for a specific domain |
|
/autoimprove eval-init | Scaffold eval script and golden set |
|
/autoimprove bootstrap | Analyze codebase, generate goal-aware tests |
|
/autoimprove bootstrap --generate | Create the test files |
But you don't need to run these separately. /autoimprove detects what's missing and walks you through it.
The improve.md Format
A single markdown file — part config, part prompt:
CODEBLOCK0
The Loop
When invoked, follow this protocol exactly:
Step 1: Readiness Check (auto-guided setup)
Run through these checks in order. If anything is missing, offer to fix it inline
rather than stopping. The user should be able to go from zero to running with a
single /autoimprove invocation.
CODEBLOCK1
The key principle: detect, offer, fix, continue. Don't stop and tell the user to run a different command. Walk them through setup inline and only block on things that require human judgment (uncommitted changes, golden set labeling).
Step 1a: Parse
Read the improve.md file. Extract all structured fields from the headers. Everything after ## Instructions is the domain prompt.
Step 1b: Resolve Scope
Read the Change.scope field:
- - If it contains explicit paths or globs: resolve directly
- If it contains natural language: scan the codebase, identify matching files, present for confirmation
- Apply
Change.exclude to filter - Apply default safety excludes (unless explicitly included in scope):
- Secrets:
.env,
.env.*,
*.pem,
*.key,
credentials.*,
auth.*,
secrets.*
- Infrastructure:
.git/,
.autoimprove/,
node_modules/,
vendor/
- CI/CD:
.github/workflows/,
.gitlab-ci.yml,
Jenkinsfile
- Eval artifacts: paths matching
eval/,
golden_set, test fixtures
- - Once confirmed, the resolved file list is LOCKED for the entire loop
Step 2.5: Bootstrap (when invoked via /autoimprove bootstrap)
This is a separate pre-loop phase. It generates the test suite needed for safe optimization.
Tests are mutable during bootstrap, immutable during the loop. These two phases never mix.
Bootstrap Protocol
- 1. Analyze: Read all files in the resolved
Change.scope AND the improve.md. Identify:
- The optimization goal (
Check.goal) and what the agent will be chasing
- Public API surface (exported functions, methods, classes)
- Critical code paths (hot loops, core logic, data transformations)
- Edge cases (nil/null handling, empty inputs, boundary values)
- Existing test coverage (check
Check.test-files if specified)
- 2. Goal-aware threat modeling: The optimization goal predicts what the agent will try, which predicts what it will break. Generate tests that guard against the failure modes of THAT specific goal:
When goal = faster (lower latency, fewer allocations):
The agent will skip work, take shortcuts, and remove safety checks.
- Unicode/multibyte input still works (fast paths assume ASCII)
- Empty, nil, zero-length inputs don't crash (nil checks removed for speed)
- Error messages are still correct and informative (error formatting skipped)
- Concurrent access is safe (locks removed for throughput)
- Large inputs don't OOM or hang (buffer reuse may break on edge sizes)
- Output is bit-for-bit identical to baseline (fast path might truncate or round)
When goal = smaller (image size, bundle size):
The agent will remove things, strip features, and swap dependencies.
- All features still work at runtime (removed code might have been needed)
- Lazy-loaded routes/components still render (code splitting may break references)
- Runtime dependencies are present (dev deps removed, but some were runtime)
- Static assets still load (paths may change with restructuring)
- App starts successfully and passes health checks
When goal = higher accuracy (ML, prompt quality):
The agent will overfit, leak data, and add complexity.
- No test/validation data leaks into training (train/test split integrity)
- Pipeline is reproducible with fixed seeds (randomness controlled)
- Missing values, outliers, and edge inputs handled correctly
- Model outputs are valid (probabilities sum to 1, no NaN predictions)
- Feature engineering doesn't introduce future data leakage (no lookahead)
- Predictions work on single samples, not just batches
When goal = lower cost (infra, compute):
The agent will downsize, reduce redundancy, and cut resources.
- Service still handles expected load (reduced replicas may not suffice)
- Failover still works (redundancy removed)
- Latency stays within acceptable bounds (smaller instances = slower)
- Health checks pass under load, not just at idle
- Data durability unchanged (storage reductions may lose backups)
When goal = higher score (prompt engineering, config tuning):
The agent will game the metric and overfit to the eval set.
- Output format is consistent (not just correct for eval cases)
- Edge inputs produce reasonable output (not just golden set inputs)
- Output length/verbosity stays reasonable (gaming token limits)
- No hardcoded responses for known eval cases
- 3. Gap analysis: Cross-reference existing tests with the goal-aware threat model:
-
Critical gaps: Failure modes with NO test coverage — these must be filled
-
Weak coverage: Tests exist but don't cover edge cases for this goal
-
Sufficient: Already tested well for this optimization direction
- 4. Generate (when
--generate is passed):
- Write test files to the paths specified in
Check.test-files
- Prioritize tests for critical gaps first, weak coverage second
- Run the tests to confirm they pass on the current unmodified code
- If any test fails, fix it — tests must pass on the CURRENT code before optimization
- Present the generated tests for review, grouped by threat category
- 5. Commit: After human review, commit the test files:
INLINECODE40
- 6. Report: Print a readiness summary:
CODEBLOCK2
What makes a good test suite for autoimprove
The tests don't need to be exhaustive — they need to catch the specific kinds of breakage
that an agent chasing YOUR optimization goal is likely to introduce.
The key insight: the goal tells you what the agent will try, and that tells you what to test.
Tests SHOULD:
- - Guard against the failure modes predicted by the goal (see threat model above)
- Verify output equivalence: same input produces same output regardless of internals
- Cover edge cases the agent's fast paths will skip: empty, nil, unicode, huge, concurrent
- Assert on API contracts: signatures, return types, side effects
- Be fast — slow tests make the loop slower and waste experiment budget
Tests should NOT:
- - Assert on performance (that's what the score is for)
- Assert on internal implementation details (the agent SHOULD change those)
- Be flaky or timing-dependent (false failures poison the loop)
- Be so numerous they dominate experiment time (keep test suite under 30s)
Step 2.6: Eval Init (when invoked via /autoimprove eval-init)
This phase scaffolds the evaluation harness — the check command and scoring mechanism.
Not every domain needs this. Use the table below to decide.
Which domains need eval scaffolding?
| Domain | Needs golden set? | Needs eval script? | Why |
|---|
| INLINECODE42 | Yes | Yes | "Good answer" is subjective — needs labeled Q&A pairs |
| INLINECODE43 |
Yes | Yes | Output quality requires human-labeled expected outputs |
|
automl | Yes | Yes | Model accuracy needs labeled train/test data |
|
ml | No | Usually exists | Training scripts typically emit loss/metrics already |
|
perf | No | No | Benchmarks produce objective numbers |
|
docker | No | No | Image size in bytes is objective |
|
frontend | No | No | Bundle size in bytes is objective |
|
ci | No | No | Build time is objective |
|
sql | No | No | Query time is objective |
|
k8s | No | No | Pod health is objective |
Eval Init Protocol (for domains that need it)
- 1. Detect domain: Read the codebase and
improve.md to determine the domain type.
- 2. Scaffold eval script: Generate a skeleton evaluation script that:
- Imports the system under test
- Loads a golden set from a JSON file
- Runs each test case through the system
- Computes appropriate metrics for the domain:
- Search/RAG: precision, recall, MRR, NDCG, answer relevancy
- ML/AutoML: AUC-ROC, F1, accuracy, confusion matrix
- Prompts: F1, exact match, semantic similarity
- Prints INLINECODE53
- 3. Build golden set interactively:
- Run the system with sample inputs from the domain
- Present the outputs to the user
- Ask: "Is this a good result? What should the correct output be?"
- Save labeled results as the golden set
- Aim for 20-50 test cases (enough for statistical significance, few enough to run fast)
- 4. Validate golden set: After creation, verify that:
- Every expected result actually exists in the data/system (no impossible expectations)
- The eval script runs without errors on the current code
- The baseline score is reasonable (not 0.0 or 1.0 — both suggest a broken eval)
- Error rate on the baseline run is below 20% (above that, the eval is unreliable)
- 5. Report:
CODEBLOCK3
Step 3: Baseline
Run the check command. Extract the score.
Pre-loop error check: If the eval produces any errors (non-zero error rate), report them:
CODEBLOCK4
If error rate > 20%, REFUSE to start the loop. Too many failures make the score unreliable.
Save baseline to .autoimprove/baseline.json:
CODEBLOCK5
Print: INLINECODE55
The budget timer starts NOW, not during setup. Bootstrap, eval-init, and baseline establishment are excluded from the budget.
Step 4: Loop
CODEBLOCK6
Step 5: Summary
When the loop ends, print:
CODEBLOCK7
Stopping Conditions
The loop stops when ANY of these is true:
- -
budget time has elapsed (measured from first experiment, not from setup) - INLINECODE57 experiments have been run
- Score has reached INLINECODE58
- INLINECODE59 consecutive experiments failed to improve
- Manual interrupt (Ctrl+C or agent termination)
Score Extraction
Try these in order:
- 1. Convention: look for
SCORE: <number> in stdout - Regex: if
score: field contains a regex pattern (has parentheses), apply it to stdout - jq: if
score: field starts with ., treat as jq expression applied to stdout as JSON
Guard Metrics
Optional secondary metrics that must not regress. Format in improve.md:
CODEBLOCK8
This means: extract error_rate from stdout using the regex, and reject the experiment if the value exceeds 0.05. Guard failures are logged as "guard_failed" and count toward consecutive failures.
Use guards to prevent the agent from improving one metric by tanking another. Common examples:
- -
guard: error_rate: ([\d.]+) < 0.05 — search errors stay below 5% - INLINECODE67 — tail latency stays under 500ms
- INLINECODE68 — memory usage stays under 1GB
Experiment Logging
Each experiment is saved as JSON in .autoimprove/experiments/:
CODEBLOCK9
The supersedes field
When an experiment fundamentally replaces the approach from a previous experiment, set supersedes to the list of experiment IDs that are now obsolete:
CODEBLOCK10
When reading past experiments, skip any whose ID appears in a later kept experiment's supersedes list. This prevents wasting rounds on variations of discarded approaches.
Prerequisites and security
Runtime requirements: git is required. The check commands in your improve.md determine what else is needed (go, python, npm, docker, kubectl, psql, etc.). Verify these are installed before starting.
Credentials: The agent runs arbitrary shell commands from your improve.md. It inherits whatever credentials are available to the process (AWS keys, DB creds, kubeconfigs, API tokens). Run autoimprove with least-privilege credentials. Strip environment variables you don't want the agent to access.
First run: Always interactive. The readiness check (Step 1) confirms scope, reviews generated tests, and establishes a baseline before the loop starts. Don't run headless until you've verified one interactive run works correctly.
Backup: Before headless runs, the readiness check creates a backup branch automatically. The loop uses git commits and resets for rollback, but the backup branch protects against edge cases.
Scope enforcement: The rules below (NEVER modify files outside scope) are policy constraints, not technical enforcement. The agent follows them in practice, but there is no sandbox preventing out-of-scope edits. For sensitive repos, run in a cloned fork or container where damage is reversible.
Rules
- - NEVER modify files outside the resolved INLINECODE73
- NEVER modify test files during the optimization loop — tests are immutable guardrails
- NEVER modify the check command, eval script, golden set, or scoring mechanism
- NEVER skip the git commit before running the check
- NEVER proceed to eval without verifying HEAD changed (confirms commit succeeded)
- ALWAYS log every experiment, including failures and errors
- ALWAYS read past experiments before proposing a new change
- ALWAYS skip superseded experiments when reading the log
- ALWAYS git reset discarded experiments — leave the tree clean
- Test failure means immediate discard — no exceptions, no "but the score improved"
- Guard failure means immediate discard — secondary metrics must not regress
- If many experiments fail tests, the change strategy is wrong — try a different approach
- Complexity must pay for itself: 20 lines of hack for 0.001 improvement is NOT worth keeping
- Deleting code for equal results IS worth keeping (use keepifequal: true)
Init Templates
When the user runs /autoimprove init, detect the repo type and suggest the right template. If --type is specified, use that directly.
Detection heuristics:
- -
go.mod or Cargo.toml or Makefile with benchmark targets → INLINECODE79 - INLINECODE80 or
*.ipynb with torch/tensorflow imports → INLINECODE82 - INLINECODE83 with sklearn/xgboost/lightgbm/catboost imports → INLINECODE84
- Python files with langchain/llama_index/chromadb/pinecone/weaviate/qdrant imports → INLINECODE85
- INLINECODE86 → INLINECODE87
- INLINECODE88 with
kind: Deployment → INLINECODE90 - INLINECODE91 directory or
eval/ with score outputs → INLINECODE93 - INLINECODE94 files → INLINECODE95
- INLINECODE96 with build script → INLINECODE97
- INLINECODE98 → INLINECODE99
For domains that need eval scaffolding (rag, prompt, automl), also suggest running /autoimprove eval-init after init.
See references/examples.md for all templates.
Export Mode
When /autoimprove --export is invoked, generate a program.md file that any agent can follow without the skill. This makes autoimprove agent-agnostic.
See references/protocol.md for the exported protocol template.
Multi-Shot Examples
The following examples demonstrate how autoimprove applies across domains. Use these to understand the SHAPE of good experiments — what kinds of changes to try, how to structure the check command, and what good improve.md instructions look like.
Example 1: API Latency
CODEBLOCK11
Example 2: Docker Image Size
CODEBLOCK12
Example 3: LLM Prompt Quality
CODEBLOCK13
Example 4: Build/CI Speed
CODEBLOCK14
Example 5: ML Training (Karpathy-style)
CODEBLOCK15
Example 6: Kubernetes Cluster Health
CODEBLOCK16
Example 7: SQL Query Performance
CODEBLOCK17
Example 8: Frontend Bundle Size
CODEBLOCK18
Example 9: Tabular ML / AutoML (Churn, Fraud, Scoring)
This is the most common ML task across companies — predicting outcomes on structured data. Traditional AutoML (AutoSklearn, FLAML, AutoGluon) searches a predefined hyperparameter grid. Autoimprove goes further: it can engineer new features, rewrite preprocessing, swap model architectures, and delete dead code.
CODEBLOCK19
This example works for any tabular prediction task — swap "churn" for fraud detection,
credit scoring, lead conversion, demand forecasting, or insurance pricing. The structure
is the same: features in columns, a target to predict, a metric to maximize.
Example 10: RAG Pipeline Optimization
RAG (Retrieval-Augmented Generation) pipelines have many interacting knobs — chunking,
embedding, retrieval, reranking, prompt template, context window management. Small
changes compound: better chunking improves retrieval which improves generation quality.
Autoimprove can explore this space much faster than manual tuning.
CODEBLOCK20
This works for any RAG system — internal knowledge bases, customer support bots,
documentation search, legal document retrieval, or code search. The score metric
can be swapped: use faithfulness to reduce hallucination, context_precision to
improve retrieval, or a composite RAGAS score for overall quality.