Agent Cost Strategy

Use the cheapest model that can reliably do the job. Most tasks don't need your most powerful model.

The Three Tiers

Tier	When to Use	Examples
Fast/Cheap	Sub-agents, background tasks, automated fixes, simple lookups, short replies	Claude Haiku, GPT-4o-mini, Gemini Flash
Mid-tier

Main session dialogue, moderate reasoning, multi-step tasks | Claude Sonnet, GPT-4o, Gemini Pro | | Powerful | Architecture decisions, deep reviews, hard problems, after cheaper models fail twice | Claude Opus, GPT-4.5, Gemini Ultra |

Task → Tier Routing

CODEBLOCK0

Heartbeat / Cron Model Rule

Always specify the cheapest model for scheduled and background tasks — they run frequently and costs add up fast. Check your platform's config for how to set a model per cron/heartbeat job.

For heartbeat intervals: set them just under your provider's cache TTL to keep the prompt cache warm and pay cache-read rates instead of full input rates. Check your provider's docs for the exact TTL.

Communication Pattern Rule

One-word and short conversational messages (hi, thanks, ok, sure, yes, no) should always route to Fast/Cheap. Never burn a mid-tier or powerful model on an acknowledgment.

Cache Optimization

Prompt caching cuts costs 50-90% on repeated context. Cache writes cost ~25% more but pay off after just 1-2 reuses. See references/cache-optimization.md for patterns and break-even math.

Batch API (Non-Urgent Tasks)

For cron jobs, scheduled analysis, or anything that doesn't need an immediate response — use the Batch API (Anthropic/OpenAI both offer it). 50% discount in exchange for async delivery (results within 24h). Never use real-time API for background work that can wait.

Sub-Agent Model Rule (Critical)

Always explicitly set the model when spawning sub-agents. Never rely on defaults — the default inherits the parent session model (expensive mid-tier). One month of sub-agents defaulting to Sonnet = 96% of costs going to Sonnet when it should be split ~80/20 Haiku/Sonnet.

CODEBLOCK1

Default sub-agent tasks to Haiku for cost efficiency. Override with a stronger model when task complexity or accuracy requirements justify it.

New Session / Machine Cold Start Cost

When starting a fresh session (new machine, new session after /new), the cache is empty. The first few messages will write the entire context (skills, workspace files, memory) to cache at 1.25x the normal input rate. This is unavoidable but temporary — it pays off within 2-3 messages once the cache warms up.

Don't panic at the first few messages being expensive on a new machine. The cache write cost is a one-time investment that makes every subsequent message ~90% cheaper.

Signs You're Over-Spending

- Running powerful models on tasks Fast/Cheap can handle
No caching on repeated system prompts
Heartbeat/cron jobs using the default (expensive) model
Sub-agents spawned without explicit model = biggest cost leak

Session & Cache Management

Keep sessions alive when possible — longer sessions build cache and reduce costs. Only end sessions when context is genuinely full or for privacy reasons.

Anthropic's prompt cache builds from repeated context within a live session. When a session starts fresh, all context (system prompt, workspace files, skills) loads cold — typically 400-600k tokens at full cost. Once cached, subsequent messages cost ~10% of that.

The math:

- Cold session start: 600k tokens × full price = expensive
After cache warms up: 600k tokens × 10% cache price = ~90% cheaper per message
Ending a session destroys the cache and forces a full cold reload next time

Rules:

- Let sessions run as long as possible for cost efficiency
Only start a new session (/new) when context is genuinely full (>80%) or when you need a fresh privacy boundary
Ending sessions should be intentional — for privacy/data-retention reasons, not routine cost management
The longer a session runs, the cheaper each message gets

Privacy & Cache Note: Cached context may include workspace files and memory — avoid caching sessions containing secrets or sensitive PII. If a session will cache sensitive data, plan to end it when done.

Delegation rule (keep main agent lean):

- Main agent (Sonnet/mid-tier) = conversational only: planning, coordination, reviewing results
Sub-agents (Haiku/fast-cheap) = all actual doing: file edits, research, builds, data tasks
Keeping the main agent conversational reduces its context growth and keeps cache hits high

Agent Cost Strategy

使用能够可靠完成任务的最便宜模型。大多数任务不需要使用你最强大的模型。

三个层级

层级	使用场景	示例
快速/廉价	子代理、后台任务、自动修复、简单查询、简短回复	Claude Haiku、GPT-4o-mini、Gemini Flash
中端

任务 → 层级路由

text
修复失败的测试 → 快速/廉价
编写样板代码 → 快速/廉价
研究/搜索 → 快速/廉价
定时/计划任务 → 快速/廉价（始终）
简短回复（嗨、好的） → 快速/廉价（始终）
后台监控 → 快速/廉价（始终）
构建新功能 → 中端
审查PR → 中端
主助手对话 → 中端（默认）
架构决策 → 强大
深度代码审查 → 强大
两次尝试后卡住 → 升级一个层级

心跳/定时任务模型规则

始终为定时和后台任务指定最便宜的模型——它们运行频繁，成本会迅速累积。查看你的平台配置，了解如何为每个定时/心跳任务设置模型。

对于心跳间隔：将其设置得略低于你的提供商缓存TTL，以保持提示缓存温暖，并支付缓存读取费率而非完整输入费率。查看你的提供商文档了解确切的TTL。

通信模式规则

单字和简短的对话消息（嗨、谢谢、好的、当然、是、否）应始终路由到快速/廉价。绝不要在确认消息上消耗中端或强大模型。

缓存优化

提示缓存可将重复上下文的成本降低50-90%。缓存写入成本约高25%，但仅需1-2次重用即可收回成本。参见 references/cache-optimization.md 了解模式和盈亏平衡计算。

批量API（非紧急任务）

对于定时任务、计划分析或任何不需要立即响应的任务——使用批量API（Anthropic/OpenAI都提供）。50%折扣，以换取异步交付（24小时内出结果）。绝不要为可以等待的后台工作使用实时API。

子代理模型规则（关键）

生成子代理时始终明确指定模型。 切勿依赖默认值——默认值会继承父会话模型（昂贵的中端）。一个月内子代理默认使用Sonnet = 96%的成本流向Sonnet，而本应大约80/20分配给Haiku/Sonnet。

text
sessions_spawn → 始终包含 model: claude-haiku-4-5-20251001（或等效的快速廉价模型）

默认将子代理任务分配给Haiku以提高成本效率。当任务复杂度或准确性要求证明有必要时，再覆盖为更强的模型。

新会话/机器冷启动成本

启动新会话（新机器、/new后的新会话）时，缓存为空。前几条消息将以正常输入费率1.25倍的价格将整个上下文（技能、工作区文件、记忆）写入缓存。这是不可避免但暂时的——一旦缓存预热，2-3条消息内即可收回成本。

不要因为新机器上前几条消息昂贵而惊慌。 缓存写入成本是一次性投资，使后续每条消息便宜约90%。

过度支出的迹象

- 在快速/廉价可以处理的任务上运行强大模型
重复系统提示没有缓存
心跳/定时任务使用默认（昂贵）模型
子代理生成时未指定模型 = 最大的成本漏洞

会话与缓存管理

尽可能保持会话活跃——更长的会话建立缓存并降低成本。仅在上下文真正满时或出于隐私原因才结束会话。

Anthropic的提示缓存通过实时会话中的重复上下文构建。当会话全新启动时，所有上下文（系统提示、工作区文件、技能）冷加载——通常400-600k tokens按全价计费。一旦缓存，后续消息成本约为其10%。

计算方式：

- 冷会话启动：600k tokens × 全价 = 昂贵
缓存预热后：600k tokens × 10%缓存价格 = 每条消息便宜约90%
结束会话会销毁缓存，下次强制完全冷加载

规则：

- 让会话尽可能长时间运行以提高成本效率
仅在上下文真正满（>80%）或需要新的隐私边界时才启动新会话（/new）
结束会话应是刻意的——出于隐私/数据保留原因，而非常规成本管理
会话运行时间越长，每条消息越便宜

隐私与缓存说明： 缓存的上下文可能包含工作区文件和记忆——避免缓存包含秘密或敏感PII的会话。如果会话将缓存敏感数据，计划在完成后结束它。

委派规则（保持主代理精简）：

- 主代理（Sonnet/中端）= 仅对话：规划、协调、审查结果
子代理（Haiku/快速廉价）= 所有实际操作：文件编辑、研究、构建、数据任务
保持主代理对话性可减少其上下文增长并保持高缓存命中率

agent-cost-strategy智能成本策略