Free Scaling
$0 test-time scaling infrastructure using NVIDIA NIM free tier.
Three patterns, one API key:
CODEBLOCK0
Setup
- 1. Get a free API key at build.nvidia.com
- INLINECODE0
- No pip install — stdlib only (Python 3.10+)
Core API
scale(question, context, k, answer_patterns) → CascadeResult
Classification via ensemble voting. Ask k models, majority wins.
CODEBLOCK1
Parameters:
- -
question — what to judge (should end with "Answer X or Y") - INLINECODE3 — material to evaluate (placed in system message)
- INLINECODE4 — models to query: 1, 3, 5, or
"auto" (smart cascade) - INLINECODE6 — expected answers (e.g.
["YES", "NO"]) - INLINECODE8 — override model selection (list of aliases)
generate(question, context, k) → GenerateResult
Best-of-k generation with cross-evaluation. Round 1: k models generate. Round 2: k different models judge which is best.
CODEBLOCK2
scale_batch(items, k) / generate_batch(items, k)
Parallel batch versions. Each item is a dict with question, context, answer_patterns.
CODEBLOCK3
health(models=None) → dict
Probe models. Returns status per model (ok/dead/slow/error + latency).
CODEBLOCK4
Dead models are auto-skipped in subsequent calls and retried after 5 minutes.
Online Learning (v3.3)
Models self-select through deployment data. No manual benchmarking needed.
CODEBLOCK5
How it works:
- - Consensus: models that agree with majority get +ELO (K=16)
- Override: user feedback is 4× stronger (K=64)
- Shadow challenger: 1 extra model per call for free A/B data
- Evolution: top-3 by ELO become champion panel (requires 30+ calls/model)
Smart Features
- - Online learning: ELO-based model scoring from deployment data (see above)
- A/B testing: shadow challengers run alongside panel for competitive signal
- Auto-heal: 404/410 models get marked dead, substituted with same-tier alternatives, retried after 5min TTL
- Context routing:
context goes in system message, question stays in user message - Parallel short-circuit: submits all k models in parallel, cancels remaining when first 2 agree
- Task classification:
k="auto" classifies the question type and routes to the best expert - Copilot integration:
cp-* aliases route automatically through GitHub Copilot API - User feedback loop: Discord reaction → ELO update (👍 confirm, 🅰️🅱️ A/B, 🔴🟡⚪ override)
- Error isolation: batch functions catch per-item failures without killing the batch
13 Models Included
| Tier | Models | Latency |
|---|
| Fast | llama-3.3 70B, gemma-27b, nemotron-super-49b, dracarys-70b, jamba-mini | <1s |
| Medium |
mistral-large 675B, kimi-k2, qwen-397b, llama-405b, mistral-medium | 1-3s |
| Thinking | deepseek-v3.1, minimax-m2.5 🧠, kimi-k2.5 🧠 | 3s+ |
All free via NVIDIA NIM. One API key covers everything.
CLI
CODEBLOCK6
Capability Profiling (optional)
Profile models on your tasks for data-driven routing:
CODEBLOCK7
Generates capability_map.json — the cascade loads it automatically.
Architecture
CODEBLOCK8
Requirements
- -
NVIDIA_API_KEY environment variable (free at build.nvidia.com) - Python 3.10+ (stdlib only, no pip dependencies)
- Optional: GitHub Copilot token for
cp-* model aliases
Free Scaling
使用NVIDIA NIM免费层的$0测试时扩展基础设施。
三种模式,一个API密钥:
python
from free_scaling import scale, generate, health
分类 — 对标签进行投票
result = scale(这是安全的吗?, context=code, k=3,
answer_patterns=[安全, 有漏洞])
生成 — 交叉评估的最佳k选
result = generate(总结这篇论文。, context=paper, k=3)
验证 — 只需将源+输出作为context调用scale()
check = scale(是否存在幻觉性表述?,
context=f来源:\n{src}\n\n输出:\n{draft},
k=3, answer_patterns=[是, 否])
设置
- 1. 在 build.nvidia.com 获取免费API密钥
- export NVIDIAAPI_KEY=nvapi-...
- 无需pip安装 — 仅使用标准库(Python 3.10+)
核心API
scale(question, context, k, answer_patterns) → CascadeResult
通过集成投票进行分类。询问k个模型,多数胜出。
python
result = scale(
这封邮件紧急吗?请回答紧急、普通或忽略。,
context=email_body,
k=3,
answer_patterns=[紧急, 普通, 忽略]
)
result.answer # 普通
result.confidence # 1.0
result.calls_made # 3
result.elapsed_s # 1.8
参数:
- - question — 要判断的问题(应以回答X或Y结尾)
- context — 要评估的材料(放在系统消息中)
- k — 要查询的模型数量:1、3、5或auto(智能级联)
- answer_patterns — 预期答案(例如[是, 否])
- models — 覆盖模型选择(别名列表)
generate(question, context, k) → GenerateResult
交叉评估的最佳k选生成。第一轮:k个模型生成。第二轮:k个不同模型判断哪个最好。
python
result = generate(
用两句话总结这封邮件。,
context=email_text,
k=3,
max_tokens=200,
)
result.output # 获胜的摘要
result.all_outputs # 所有3个摘要
result.winner_model # llama-3.3
result.judge_votes # [2, 2, 2]
result.total_calls # 6(3个生成 + 3个判断)
scalebatch(items, k) / generatebatch(items, k)
并行批量版本。每个项目是一个包含question、context、answer_patterns的字典。
python
results = scale_batch([
{question: 紧急吗?, context: e, answer_patterns: [是, 否]}
for e in emails
], k=3)
health(models=None) → dict
探测模型。返回每个模型的状态(正常/宕机/慢速/错误 + 延迟)。
python
status = health() # 所有模型
status = health(models=[llama-3.3, gemma-27b]) # 特定模型
宕机模型在后续调用中自动跳过,并在5分钟后重试。
在线学习(v3.3)
模型通过部署数据自我选择。无需手动基准测试。
python
from free_scaling import elo, feedback
from free_scaling.evolve import evolve, report
每次scale()调用自动:
1. 将投票记录到ELO追踪器
2. 运行1个影子挑战者获取A/B数据
3. 记录结果供用户反馈解析
查看当前排名
print(elo.summary())
用户反馈(比共识信号强4倍)
feedback.resolve
byreaction(discord-msg-id, 👍) # 确认
feedback.resolve
byreaction(discord-msg-id, 🅱️) # 面板B获胜
feedback.resolve
byreaction(discord-msg-id, 🔴) # 覆盖为紧急
每周面板进化
result = evolve(dry_run=True) # 检查面板是否应更改
result = evolve(dry_run=False) # 应用更改
工作原理:
- - 共识:与多数一致的模型获得+ELO(K=16)
- 覆盖:用户反馈强度是4倍(K=64)
- 影子挑战者:每次调用额外1个模型获取免费A/B数据
- 进化:ELO前三名成为冠军面板(每个模型需要30+次调用)
智能特性
- - 在线学习:基于部署数据的ELO模型评分(见上文)
- A/B测试:影子挑战者与面板并行运行获取竞争信号
- 自动修复:404/410模型标记为宕机,用同等级替代品替换,5分钟TTL后重试
- 上下文路由:context放入系统消息,question保留在用户消息中
- 并行短路:并行提交所有k个模型,当前2个一致时取消剩余
- 任务分类:k=auto分类问题类型并路由到最佳专家
- Copilot集成:cp-*别名通过GitHub Copilot API自动路由
- 用户反馈循环:Discord反应 → ELO更新(👍确认,🅰️🅱️ A/B,🔴🟡⚪覆盖)
- 错误隔离:批量函数捕获单个项目失败而不影响整个批次
包含13个模型
| 等级 | 模型 | 延迟 |
|---|
| 快速 | llama-3.3 70B, gemma-27b, nemotron-super-49b, dracarys-70b, jamba-mini | <1s |
| 中等 |
mistral-large 675B, kimi-k2, qwen-397b, llama-405b, mistral-medium | 1-3s |
| 思考 | deepseek-v3.1, minimax-m2.5 🧠, kimi-k2.5 🧠 | 3s+ |
全部通过NVIDIA NIM免费使用。一个API密钥覆盖所有。
CLI
bash
python3 -m nim_ensemble.cli scale 这是安全的吗? -k 3 --answers 安全,有漏洞
python3 -m nim_ensemble.cli models # 列出可用模型
python3 -m nim_ensemble.cli panels # 列出面板
能力分析(可选)
在您的任务上分析模型以实现数据驱动的路由:
bash
python3 -m nimensemble.capabilitymap --models llama-3.3 gemma-27b mistral-large --trials 3
生成capability_map.json — 级联自动加载。
架构
nim_ensemble/
├── init.py # 导出:scale, generate, health, scalebatch, generatebatch
├── cascade.py # scale(), scale_batch(), 智能级联
├── generate.py # generate(), generate_batch(), 最佳k选
├── voter.py # 核心投票引擎,NIM + Copilot后端
├── health.py # 模型探测,宕机模型追踪,替换
├── models.py # 模型注册表,面板
├── parser.py # 答案提取(思考模型,否定,词边界)
├── elo.py # 在线ELO评分,模型排名
├── feedback.py # 用户反馈循环(反应 → ELO更新)
├── evolve.py # 每周面板进化(按ELO晋升/降级)
├── cli.py # CLI接口
├── benchmark.py # 单次试验分析
└── capability_map.py # 带错误关联的多试验分析
要求
- - NVIDIAAPIKEY环境变量(在build.nvidia.com免费获取)
- Python 3.10+(仅标准库,无pip依赖)
- 可选:用于cp-*模型别名的GitHub Copilot令牌