A.I. Smart-Router
Intelligently route requests to the optimal AI model using tiered classification with automatic fallback handling and cost optimization.
How It Works (Silent by Default)
The router operates transparently—users send messages normally and get responses from the best model for their task. No special commands needed.
Optional visibility: Include [show routing] in any message to see the routing decision.
Tiered Classification System
The router uses a three-tier decision process:
CODEBLOCK0
Intent Detection Patterns
CODE Intent
- - Keywords: write, code, debug, fix, refactor, implement, function, class, script, API, bug, error, compile, test, PR, commit
- File extensions mentioned: .py, .js, .ts, .go, .rs, .java, etc.
- Code blocks in input
ANALYSIS Intent
- - Keywords: analyze, explain, compare, research, understand, why, how does, evaluate, assess, review, investigate, examine
- Long-form questions
- "Help me understand..."
CREATIVE Intent
- - Keywords: write (story/poem/essay), create, brainstorm, imagine, design, draft, compose
- Fiction/narrative requests
- Marketing/copy requests
REALTIME Intent
- - Keywords: now, today, current, latest, trending, news, happening, live, price, score, weather
- X/Twitter mentions
- Stock/crypto tickers
- Sports scores
GENERAL Intent (Default)
- - Simple Q&A
- Translations
- Summaries
- Conversational
MIXED Intent (Multiple Intents Detected)
When a request contains multiple clear intents (e.g., "Write code to analyze this data and explain it creatively"):
- 1. Identify primary intent — What's the main deliverable?
- Route to highest-capability model — Mixed tasks need versatility
- Default to COMPLEX complexity — Multi-intent = multi-step
Examples:
- - "Write code AND explain how it works" → CODE (primary) + ANALYSIS → Route to Opus
- "Summarize this AND what's the latest news on it" → REALTIME takes precedence → Grok
- "Creative story using real current events" → REALTIME + CREATIVE → Grok (real-time wins)
Language Handling
Non-English requests are handled normally — all supported models have multilingual capabilities:
| Model | Non-English Support |
|---|
| Opus/Sonnet/Haiku | Excellent (100+ languages) |
| GPT-5 |
Excellent (100+ languages) |
| Gemini Pro/Flash | Excellent (100+ languages) |
| Grok | Good (major languages) |
Intent detection still works because:
- - Keyword patterns include common non-English equivalents
- Code intent detected by file extensions, code blocks (language-agnostic)
- Complexity estimated by query length (works across languages)
Edge case: If intent unclear due to language, default to GENERAL intent with MEDIUM complexity.
Complexity Signals
Simple Complexity ($)
- - Short query (<50 words)
- Single question mark
- "Quick question", "Just tell me", "Briefly"
- Yes/no format
- Unit conversions, definitions
Medium Complexity ($$)
- - Moderate query (50-200 words)
- Multiple aspects to address
- "Explain", "Describe", "Compare"
- Some context provided
Complex Complexity ($$$)
- - Long query (>200 words) or complex task
- "Step by step", "Thoroughly", "In detail"
- Multi-part questions
- Critical/important qualifier
- Research, analysis, or creative work
Routing Matrix
| Intent | Simple | Medium | Complex |
|---|
| CODE | Sonnet | Opus | Opus |
| ANALYSIS |
Flash | GPT-5 | Opus |
|
CREATIVE | Sonnet | Opus | Opus |
|
REALTIME | Grok | Grok | Grok-3 |
|
GENERAL | Flash | Sonnet | Opus |
Token Exhaustion & Automatic Model Switching
When a model becomes unavailable mid-session (token quota exhausted, rate limit hit, API error), the router automatically switches to the next best available model and notifies the user.
Notification Format
When a model switch occurs due to exhaustion, the user receives a notification:
CODEBLOCK1
Switch Reasons
| Reason | Description |
|---|
| INLINECODE1 | Daily/monthly token limit reached |
| INLINECODE2 |
Too many requests per minute |
|
context window exceeded | Input too large for model |
|
API timeout | Model took too long to respond |
|
API error | Provider returned an error |
|
model unavailable | Model temporarily offline |
Implementation
CODEBLOCK2
Fallback Priority for Token Exhaustion
When a model is exhausted, the router selects the next best model for the same task type:
| Original Model | Fallback Priority (same capability) |
|---|
| Opus | Sonnet → GPT-5 → Grok-3 → Gemini Pro |
| Sonnet |
GPT-5 → Grok-3 → Opus → Haiku |
| GPT-5 | Sonnet → Opus → Grok-3 → Gemini Pro |
| Gemini Pro | Flash → GPT-5 → Opus → Sonnet |
| Grok-2/3 | (warn: no real-time fallback available) |
User Acknowledgment
After a model switch, the agent should note in the response that:
- 1. The original model was unavailable
- Which model actually completed the request
- The response quality may differ from the original model's typical output
This ensures transparency and sets appropriate expectations.
Streaming Responses with Fallback
When using streaming responses, fallback handling requires special consideration:
CODEBLOCK3
Key insight: Wait for the first chunk before committing to a model. If the first chunk times out, fall back before any partial response is shown to the user.
Retry Timing Configuration
CODEBLOCK4
Circuit breaker: If a model fails 3 times in 5 minutes, skip it entirely for the next 5 minutes. This prevents repeatedly hitting a down service.
Fallback Chains
When the preferred model fails (rate limit, API down, error), cascade to the next option:
Code Tasks
CODEBLOCK5
Analysis Tasks
CODEBLOCK6
Creative Tasks
CODEBLOCK7
Real-time Tasks
CODEBLOCK8
General Tasks
CODEBLOCK9
Long Context (Tiered by Size)
CODEBLOCK10
Implementation:
CODEBLOCK11
Example Error Output:
CODEBLOCK12
Dynamic Model Discovery
The router auto-detects available providers at runtime:
CODEBLOCK13
Example: If only Anthropic and Google are configured:
- - Code tasks → Opus (Anthropic available ✓)
- Real-time tasks → ⚠️ No Grok → Fall back to Opus + warn user
- Long docs → Gemini Pro (Google available ✓)
Cost Optimization
The router considers cost when complexity is LOW:
| Model | Cost Tier | Use When |
|---|
| Gemini Flash | $ | Simple tasks, high volume |
| Claude Haiku |
$ | Simple tasks, quick responses |
| Claude Sonnet | $$ | Medium complexity |
| Grok 2 | $$ | Real-time needs only |
| GPT-5 | $$ | General fallback |
| Gemini Pro | $$$ | Long context needs |
| Claude Opus | $$$$ | Complex/critical tasks |
Rule: Never use Opus ($$$) for tasks that Flash ($) can handle.
User Controls
Show Routing Decision
Add
[show routing] to any message:
[show routing] What's the weather in NYC?
Output includes:
CODEBLOCK15
Force Specific Model
Explicit overrides:
- - "use grok: ..." → Forces Grok
- "use claude: ..." → Forces Opus
- "use gemini: ..." → Forces Gemini Pro
- "use flash: ..." → Forces Gemini Flash
- "use gpt: ..." → Forces GPT-5
Check Router Status
Ask: "router status" or "/router" to see:
- - Available providers
- Configured models
- Current routing table
- Recent routing decisions
Implementation Notes
For Agent Implementation
When processing a request:
CODEBLOCK16
Cost-Aware Routing Flow (Critical Order)
CODEBLOCK17
Cost Optimization: Two Approaches
CODEBLOCK18
Spawning with Different Models
Use sessions_spawn for model routing:
CODEBLOCK19
Security
- - Never send sensitive data to untrusted models
- API keys handled via environment/auth profiles only
- See
references/security.md for full security guidance
Model Details
See references/models.md for detailed capabilities and pricing.
A.I. 智能路由器
使用分层分类系统,智能地将请求路由到最优AI模型,具备自动回退处理和成本优化功能。
工作原理(默认静默运行)
路由器透明运行——用户正常发送消息,即可获得最适合其任务的最佳模型响应。无需特殊命令。
可选可见性:在任意消息中包含 [show routing] 即可查看路由决策。
分层分类系统
路由器采用三层决策流程:
┌─────────────────────────────────────────────────────────────────┐
│ 第一层:意图检测 │
│ 对请求的主要目的进行分类 │
├─────────────────────────────────────────────────────────────────┤
│ 代码 │ 分析 │ 创意 │ 实时 │ 通用 │
│ 编写/调试 │ 研究 │ 写作 │ 新闻/直播 │ 问答/聊天 │
│ 重构 │ 解释 │ 故事 │ X/推特 │ 翻译 │
│ 审查 │ 比较 │ 头脑风暴 │ 价格 │ 总结 │
└──────┬───────┴──────┬──────┴─────┬──────┴─────┬─────┴─────┬─────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ 第二层:复杂度评估 │
├─────────────────────────────────────────────────────────────────┤
│ 简单($级) │ 中等($$级) │ 复杂($$$级) │
│ • 单步骤任务 │ • 多步骤任务 │ • 深度推理 │
│ • 简短回复即可 │ • 需要一定细致度 │ • 大量输出 │
│ • 事实查询 │ • 中等上下文 │ • 关键任务 │
│ → Haiku/Flash │ → Sonnet/Grok/GPT │ → Opus/GPT-5 │
└──────────────────────────┴─────────────────────┴───────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ 第三层:特殊情况覆盖 │
├─────────────────────────────────────────────────────────────────┤
│ 条件 │ 覆盖为 │
│ ─────────────────────────────────────┼─────────────────────────│
│ 上下文 >10万 tokens │ → Gemini Pro(100万上下文)│
│ 上下文 >50万 tokens │ → 仅限 Gemini Pro │
│ 需要实时数据 │ → Grok(无条件) │
│ 图像/视觉输入 │ → Opus 或 Gemini Pro │
│ 用户显式覆盖 │ → 请求的模型 │
└──────────────────────────────────────┴──────────────────────────┘
意图检测模式
代码意图
- - 关键词:编写、代码、调试、修复、重构、实现、函数、类、脚本、API、错误、编译、测试、PR、提交
- 提及的文件扩展名:.py、.js、.ts、.go、.rs、.java 等
- 输入中的代码块
分析意图
- - 关键词:分析、解释、比较、研究、理解、为什么、如何工作、评估、审查、调查、检查
- 长形式问题
- 帮我理解...
创意意图
- - 关键词:写(故事/诗歌/文章)、创作、头脑风暴、想象、设计、起草、作曲
- 虚构/叙事请求
- 营销/文案请求
实时意图
- - 关键词:现在、今天、当前、最新、热门、新闻、正在发生、直播、价格、比分、天气
- X/推特提及
- 股票/加密货币代码
- 体育比分
通用意图(默认)
混合意图(检测到多个意图)
当请求包含多个明确意图时(例如:编写代码分析这些数据并以创意方式解释):
- 1. 识别主要意图 — 主要交付物是什么?
- 路由到能力最强的模型 — 混合任务需要多功能性
- 默认为复杂复杂度 — 多意图 = 多步骤
示例:
- - 编写代码并解释其工作原理 → 代码(主要)+ 分析 → 路由到 Opus
- 总结这个,以及关于它的最新新闻 → 实时优先 → Grok
- 使用真实当前事件创作创意故事 → 实时 + 创意 → Grok(实时优先)
语言处理
非英语请求正常处理——所有支持的模型都具备多语言能力:
| 模型 | 非英语支持 |
|---|
| Opus/Sonnet/Haiku | 优秀(100+种语言) |
| GPT-5 |
优秀(100+种语言) |
| Gemini Pro/Flash | 优秀(100+种语言) |
| Grok | 良好(主要语言) |
意图检测仍然有效,因为:
- - 关键词模式包含常见的非英语等价词
- 代码意图通过文件扩展名、代码块检测(语言无关)
- 复杂度通过查询长度估算(跨语言适用)
边缘情况: 如果因语言原因意图不明确,默认使用通用意图和中等复杂度。
复杂度信号
简单复杂度($)
- - 短查询(<50词)
- 单个问号
- 快速问题、直接告诉我、简要
- 是/否格式
- 单位转换、定义
中等复杂度($$)
- - 中等查询(50-200词)
- 需要处理多个方面
- 解释、描述、比较
- 提供了一些上下文
复杂复杂度($$$)
- - 长查询(>200词)或复杂任务
- 逐步、彻底、详细
- 多部分问题
- 关键/重要限定词
- 研究、分析或创意工作
路由矩阵
| 意图 | 简单 | 中等 | 复杂 |
|---|
| 代码 | Sonnet | Opus | Opus |
| 分析 |
Flash | GPT-5 | Opus |
|
创意 | Sonnet | Opus | Opus |
|
实时 | Grok | Grok | Grok-3 |
|
通用 | Flash | Sonnet | Opus |
Token耗尽与自动模型切换
当模型在会话中变得不可用时(token配额耗尽、速率限制达到、API错误),路由器会自动切换到下一个最佳可用模型,并通知用户。
通知格式
当因耗尽而发生模型切换时,用户会收到通知:
┌─────────────────────────────────────────────────────────────────┐
│ ⚠️ 模型切换通知 │
│ │
│ 您的请求无法在 claude-opus-4-5 上完成 │
│ (原因:token配额耗尽)。 │
│ │
│ ✅ 请求已使用以下模型完成:anthropic/claude-sonnet-4-5 │
│ │
│ 以下响应由回退模型生成。 │
└─────────────────────────────────────────────────────────────────┘
切换原因
| 原因 | 描述 |
|---|
| token配额耗尽 | 每日/每月token限制达到 |
| 速率限制超限 |
每分钟请求过多 |
| 上下文窗口超限 | 输入对模型来说太大 |
| API超时 | 模型响应时间过长 |
| API错误 | 提供商返回错误 |
| 模型不可用 | 模型暂时离线 |
实现
python
def executewithfallback(primarymodel: str, fallbackchain: list[str], request: str) -> Response:
使用自动回退和用户通知执行请求。
attempted_models = []
switch_reason = None
# 首先尝试主模型
modelstotry = [primarymodel] + fallbackchain
for model in modelstotry:
try:
response = call_model(model, request)
# 如果切换了模型,前置通知
if attempted_models:
notification = buildswitchnotification(
failedmodel=attemptedmodels[0],
reason=switch_reason,
success_model=model
)
return Response(
content=notification + \n\n---\n\n + response.content,
model_used=model,
switched=True
)
return Response(content=response.content, model_used=model, switched=False)
except TokenQuotaExhausted:
attempted_models.append(model)
switch_reason = token配额耗尽
logfallback(model, switchreason)
continue
except RateLimitExceeded:
attempted_models.append(model)
switch_reason = 速率限制超限
logfallback(model, switchreason)
continue
except ContextWindowExceeded:
attempted_