Agent Cost Strategy
Use the cheapest model that can reliably do the job. Most tasks don't need your most powerful model.
The Three Tiers
| Tier | When to Use | Examples |
|---|
| Fast/Cheap | Sub-agents, background tasks, automated fixes, simple lookups, short replies | Claude Haiku, GPT-4o-mini, Gemini Flash |
| Mid-tier |
Main session dialogue, moderate reasoning, multi-step tasks | Claude Sonnet, GPT-4o, Gemini Pro |
|
Powerful | Architecture decisions, deep reviews, hard problems, after cheaper models fail twice | Claude Opus, GPT-4.5, Gemini Ultra |
Task → Tier Routing
CODEBLOCK0
Heartbeat / Cron Model Rule
Always specify the cheapest model for scheduled and background tasks — they run frequently and costs add up fast. Check your platform's config for how to set a model per cron/heartbeat job.
For heartbeat intervals: set them just under your provider's cache TTL to keep the prompt cache warm and pay cache-read rates instead of full input rates. Check your provider's docs for the exact TTL.
Communication Pattern Rule
One-word and short conversational messages (hi, thanks, ok, sure, yes, no) should always route to Fast/Cheap. Never burn a mid-tier or powerful model on an acknowledgment.
Cache Optimization
Prompt caching cuts costs 50-90% on repeated context. Cache writes cost ~25% more but pay off after just 1-2 reuses. See references/cache-optimization.md for patterns and break-even math.
Batch API (Non-Urgent Tasks)
For cron jobs, scheduled analysis, or anything that doesn't need an immediate response — use the Batch API (Anthropic/OpenAI both offer it). 50% discount in exchange for async delivery (results within 24h). Never use real-time API for background work that can wait.
Sub-Agent Model Rule (Critical)
Always explicitly set the model when spawning sub-agents. Never rely on defaults — the default inherits the parent session model (expensive mid-tier). One month of sub-agents defaulting to Sonnet = 96% of costs going to Sonnet when it should be split ~80/20 Haiku/Sonnet.
CODEBLOCK1
Default sub-agent tasks to Haiku for cost efficiency. Override with a stronger model when task complexity or accuracy requirements justify it.
New Session / Machine Cold Start Cost
When starting a fresh session (new machine, new session after /new), the cache is empty. The first few messages will write the entire context (skills, workspace files, memory) to cache at 1.25x the normal input rate. This is unavoidable but temporary — it pays off within 2-3 messages once the cache warms up.
Don't panic at the first few messages being expensive on a new machine. The cache write cost is a one-time investment that makes every subsequent message ~90% cheaper.
Signs You're Over-Spending
- - Running powerful models on tasks Fast/Cheap can handle
- No caching on repeated system prompts
- Heartbeat/cron jobs using the default (expensive) model
- Sub-agents spawned without explicit model = biggest cost leak
Session & Cache Management
Keep sessions alive when possible — longer sessions build cache and reduce costs. Only end sessions when context is genuinely full or for privacy reasons.
Anthropic's prompt cache builds from repeated context within a live session. When a session starts fresh, all context (system prompt, workspace files, skills) loads cold — typically 400-600k tokens at full cost. Once cached, subsequent messages cost ~10% of that.
The math:
- - Cold session start: 600k tokens × full price = expensive
- After cache warms up: 600k tokens × 10% cache price = ~90% cheaper per message
- Ending a session destroys the cache and forces a full cold reload next time
Rules:
- - Let sessions run as long as possible for cost efficiency
- Only start a new session (
/new) when context is genuinely full (>80%) or when you need a fresh privacy boundary - Ending sessions should be intentional — for privacy/data-retention reasons, not routine cost management
- The longer a session runs, the cheaper each message gets
Privacy & Cache Note: Cached context may include workspace files and memory — avoid caching sessions containing secrets or sensitive PII. If a session will cache sensitive data, plan to end it when done.
Delegation rule (keep main agent lean):
- - Main agent (Sonnet/mid-tier) = conversational only: planning, coordination, reviewing results
- Sub-agents (Haiku/fast-cheap) = all actual doing: file edits, research, builds, data tasks
- Keeping the main agent conversational reduces its context growth and keeps cache hits high
Agent Cost Strategy
使用能够可靠完成任务的最便宜模型。大多数任务不需要使用你最强大的模型。
三个层级
| 层级 | 使用场景 | 示例 |
|---|
| 快速/廉价 | 子代理、后台任务、自动修复、简单查询、简短回复 | Claude Haiku、GPT-4o-mini、Gemini Flash |
| 中端 |
主会话对话、中等推理、多步骤任务 | Claude Sonnet、GPT-4o、Gemini Pro |
|
强大 | 架构决策、深度审查、难题、廉价模型两次失败后 | Claude Opus、GPT-4.5、Gemini Ultra |
任务 → 层级路由
text
修复失败的测试 → 快速/廉价
编写样板代码 → 快速/廉价
研究/搜索 → 快速/廉价
定时/计划任务 → 快速/廉价(始终)
简短回复(嗨、好的) → 快速/廉价(始终)
后台监控 → 快速/廉价(始终)
构建新功能 → 中端
审查PR → 中端
主助手对话 → 中端(默认)
架构决策 → 强大
深度代码审查 → 强大
两次尝试后卡住 → 升级一个层级
心跳/定时任务模型规则
始终为定时和后台任务指定最便宜的模型——它们运行频繁,成本会迅速累积。查看你的平台配置,了解如何为每个定时/心跳任务设置模型。
对于心跳间隔:将其设置得略低于你的提供商缓存TTL,以保持提示缓存温暖,并支付缓存读取费率而非完整输入费率。查看你的提供商文档了解确切的TTL。
通信模式规则
单字和简短的对话消息(嗨、谢谢、好的、当然、是、否)应始终路由到快速/廉价。绝不要在确认消息上消耗中端或强大模型。
缓存优化
提示缓存可将重复上下文的成本降低50-90%。缓存写入成本约高25%,但仅需1-2次重用即可收回成本。参见 references/cache-optimization.md 了解模式和盈亏平衡计算。
批量API(非紧急任务)
对于定时任务、计划分析或任何不需要立即响应的任务——使用批量API(Anthropic/OpenAI都提供)。50%折扣,以换取异步交付(24小时内出结果)。绝不要为可以等待的后台工作使用实时API。
子代理模型规则(关键)
生成子代理时始终明确指定模型。 切勿依赖默认值——默认值会继承父会话模型(昂贵的中端)。一个月内子代理默认使用Sonnet = 96%的成本流向Sonnet,而本应大约80/20分配给Haiku/Sonnet。
text
sessions_spawn → 始终包含 model: claude-haiku-4-5-20251001(或等效的快速廉价模型)
默认将子代理任务分配给Haiku以提高成本效率。当任务复杂度或准确性要求证明有必要时,再覆盖为更强的模型。
新会话/机器冷启动成本
启动新会话(新机器、/new后的新会话)时,缓存为空。前几条消息将以正常输入费率1.25倍的价格将整个上下文(技能、工作区文件、记忆)写入缓存。这是不可避免但暂时的——一旦缓存预热,2-3条消息内即可收回成本。
不要因为新机器上前几条消息昂贵而惊慌。 缓存写入成本是一次性投资,使后续每条消息便宜约90%。
过度支出的迹象
- - 在快速/廉价可以处理的任务上运行强大模型
- 重复系统提示没有缓存
- 心跳/定时任务使用默认(昂贵)模型
- 子代理生成时未指定模型 = 最大的成本漏洞
会话与缓存管理
尽可能保持会话活跃——更长的会话建立缓存并降低成本。仅在上下文真正满时或出于隐私原因才结束会话。
Anthropic的提示缓存通过实时会话中的重复上下文构建。当会话全新启动时,所有上下文(系统提示、工作区文件、技能)冷加载——通常400-600k tokens按全价计费。一旦缓存,后续消息成本约为其10%。
计算方式:
- - 冷会话启动:600k tokens × 全价 = 昂贵
- 缓存预热后:600k tokens × 10%缓存价格 = 每条消息便宜约90%
- 结束会话会销毁缓存,下次强制完全冷加载
规则:
- - 让会话尽可能长时间运行以提高成本效率
- 仅在上下文真正满(>80%)或需要新的隐私边界时才启动新会话(/new)
- 结束会话应是刻意的——出于隐私/数据保留原因,而非常规成本管理
- 会话运行时间越长,每条消息越便宜
隐私与缓存说明: 缓存的上下文可能包含工作区文件和记忆——避免缓存包含秘密或敏感PII的会话。如果会话将缓存敏感数据,计划在完成后结束它。
委派规则(保持主代理精简):
- - 主代理(Sonnet/中端)= 仅对话:规划、协调、审查结果
- 子代理(Haiku/快速廉价)= 所有实际操作:文件编辑、研究、构建、数据任务
- 保持主代理对话性可减少其上下文增长并保持高缓存命中率