Inception Token Optimizer
Reduce Inception API token consumption through prompt engineering, context management, and budget enforcement.
Free-Tier Limits (Inception Labs)
| Metric | Cap |
|---|
| Requests/min | 100 |
| Input tokens/min |
100,000 |
| Output tokens/min | 10,000 |
Core Strategies
1. Prompt Compression
- - Remove redundant instructions, filler words, and repeated context.
- Use short system prompts: "Concise answers. French." beats a 200-word persona block.
- Avoid re-sending unchanged context — only send deltas.
- Ask for short replies: "Réponds en < 100 mots."
2. Context Pruning
- - Before sending, estimate tokens:
len(text) // 4 (rough heuristic). - If total context > target budget, drop oldest messages and replace with a 1-2 sentence summary.
- Use
references/pruning-strategies.md for detailed patterns.
3. Caching
- - Identical prompts → reuse previous response. Do not re-call.
- Hash the prompt; if seen recently (within session), return cached reply.
- INLINECODE2 provides a drop-in LRU cache (256 items default).
4. Model Selection
- - Use cheaper/faster models for simple tasks (summarisation, classification).
- Reserve Mercury (or flagship) for complex reasoning only.
- Batch trivial queries into a single prompt instead of multiple calls.
5. Output Budgeting
- - Set
max_tokens explicitly — never leave it open-ended. - Target 150-200 output tokens for conversational replies.
- Use
temperature=0.7 to reduce verbose wandering.
Token Budget Guard
INLINECODE5 enforces per-minute caps using a sliding window:
CODEBLOCK0
Blocks until a slot is available. Use before every Inception API call.
When to Use This Skill
- - Before sending a prompt to Inception → compress & prune first.
- When monitoring costs → check token estimates.
- When near free-tier limits → activate budget guard.
- When building automation → integrate caching + bucket guard.
Inception Token Optimizer
通过提示工程、上下文管理和预算控制,降低Inception API的令牌消耗。
免费层级限制(Inception Labs)
100,000 |
| 输出令牌数/分钟 | 10,000 |
核心策略
1. 提示压缩
- - 移除冗余指令、填充词和重复上下文。
- 使用简短系统提示:简洁回答。法语。优于200字的人物设定模块。
- 避免重复发送未变化的上下文——仅发送差异部分。
- 要求简短回复:用少于100词回答。
2. 上下文修剪
- - 发送前估算令牌数:len(text) // 4(粗略估算)。
- 若总上下文超出目标预算,丢弃最早的消息,并用1-2句话的摘要替代。
- 详细模式请参考 references/pruning-strategies.md。
3. 缓存
- - 相同提示 → 复用先前响应。不重复调用。
- 对提示进行哈希处理;若近期(会话内)出现过,返回缓存回复。
- scripts/lru_cache.py 提供即插即用的LRU缓存(默认256项)。
4. 模型选择
- - 对简单任务(摘要、分类)使用更便宜/更快的模型。
- 仅将Mercury(或旗舰模型)保留给复杂推理。
- 将琐碎查询批量整合到单个提示中,而非多次调用。
5. 输出预算
- - 明确设置 max_tokens——绝不保持开放状态。
- 对话回复的目标输出令牌数为150-200。
- 使用 temperature=0.7 减少冗余发散。
令牌预算守卫
scripts/token_bucket.py 使用滑动窗口强制执行每分钟上限:
python
from scripts.token_bucket import TokenBucket
bucket = TokenBucket(reqpermin=100, intokpermin=100000, outtokpermin=10000)
bucket.waitforslot(intokens=500, outtokens=200)
继续执行API调用
在有可用槽位前保持阻塞。每次Inception API调用前使用。
何时使用此技能
- - 向Inception发送提示前 → 先压缩和修剪。
- 监控成本时 → 检查令牌估算值。
- 接近免费层级限制时 → 激活预算守卫。
- 构建自动化时 → 集成缓存和桶守卫。