Distributed Inference
A coordination layer for distributed inference across heterogeneous machines. Each node is autonomous — it runs its own Ollama, manages its own models, and works fine standalone. The distributed inference coordinator routes requests to the optimal node using a multi-signal distributed inference scoring function and records every distributed inference decision for analysis.
Install Distributed Inference
CODEBLOCK0
Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd
Distributed Inference Architecture
CODEBLOCK1
Distributed inference nodes discover the coordinator via mDNS (_fleet-manager._tcp.local.) or connect explicitly with --router-url. Each distributed inference node sends heartbeats every 5 seconds containing: CPU utilization, memory usage and pressure classification, disk metrics, loaded models with context lengths, available models, and an optional capacity score from the behavioral model.
Distributed Inference Scoring Function
The distributed inference coordinator evaluates every online node for every request using 7 weighted signals:
| Distributed Inference Signal | Max Weight | What it measures |
|---|
| Thermal state | +50 | Is the model already loaded in GPU memory? Hot (+50), warm (+30), cold (+10) |
| Memory fit |
+20 | Available distributed inference memory headroom relative to model size |
| Queue depth | -30 | Pending + in-flight distributed inference requests on this node:model pair |
| Wait time | -25 | Estimated distributed inference wait based on p75 historical latency × queue depth |
| Role affinity | +15 | Large models prefer high-memory distributed inference nodes |
| Availability trend | +10 | Capacity learner's prediction of distributed inference node availability |
| Context fit | +15 | Does the loaded model's context window fit the estimated distributed inference token count? |
Distributed inference nodes with insufficient memory, critical pressure, or missing models are eliminated before scoring. The highest-scoring distributed inference node wins.
Adaptive Distributed Inference Capacity
Distributed inference nodes optionally learn usage patterns and constrain their availability:
- - 168-slot behavioral model — one slot per hour of the week, learns when the distributed inference machine is typically free
- Dynamic memory ceiling — maps availability score to how much RAM the distributed inference coordinator can use
Enable with FLEET_NODE_ENABLE_CAPACITY_LEARNING=true on the distributed inference node agent.
Context-aware Distributed Inference Model Placement
The distributed inference coordinator protects against a known Ollama behavior where changing num_ctx at runtime triggers a full model reload. For an 89GB model, this causes multi-minute hangs.
- -
num_ctx ≤ loaded context → stripped from the distributed inference request - INLINECODE6 > loaded context → searches loaded models across all distributed inference nodes for sufficient context
- Configurable: INLINECODE7
Distributed Inference API
Distributed Inference Coordinator State
CODEBLOCK2
Distributed Inference (OpenAI-compatible)
CODEBLOCK3
Distributed Inference (Ollama-native)
CODEBLOCK4
Distributed Inference Model Fallback Chains
CODEBLOCK5
Distributed Inference Trace Analysis
CODEBLOCK6
Distributed Inference Node Performance
CODEBLOCK7
Distributed Inference Health and Capacity
CODEBLOCK8
Distributed Inference Model Lifecycle
CODEBLOCK9
Distributed Inference Fault Tolerance
| Mechanism | Distributed Inference Behavior |
|---|
| Auto-retry | If a distributed inference node fails before the first chunk, re-score and retry on next-best node |
| Holding queue |
When all distributed inference nodes are saturated, requests queue for up to 30 seconds |
| Zombie reaper | Background task reclaims stuck distributed inference in-flight slots |
| VRAM fallback | Routes to a loaded model in the same category rather than cold-loading |
| Auto-pull | Pulls missing models onto the distributed inference node with the most available memory |
| Graceful drain | SIGTERM triggers drain: in-flight distributed inference requests finish, pending redistribute |
Distributed Inference Data Model
All distributed inference state is in SQLite at ~/.fleet-manager/latency.db:
CODEBLOCK10
Structured distributed inference logs at ~/.fleet-manager/logs/herd.jsonl — daily rotation, 30-day retention.
Distributed Inference Dashboard
INLINECODE10 — eight tabs covering distributed inference fleet overview, trends, model insights, per-app analytics, benchmarks, health checks, model recommendations, and settings.
Distributed Inference Constraints
- - Never restart distributed inference services or modify
~/.fleet-manager/ without explicit user confirmation. - Distributed inference model pull/delete operations require user confirmation (10-100+ GB transfers).
- If the distributed inference coordinator is unreachable, suggest
herd or uv run herd. - If no distributed inference nodes are online, suggest
herd-node on target machines.
分布式推理
跨异构机器的分布式推理协调层。每个节点都是自治的——它运行自己的Ollama,管理自己的模型,并且可以独立正常工作。分布式推理协调器使用多信号分布式推理评分函数将请求路由到最优节点,并记录每个分布式推理决策以供分析。
安装分布式推理
bash
pip install ollama-herd
herd # 启动分布式推理协调器
herd-node # 在每个节点上启动分布式推理代理
包:ollama-herd | 仓库:github.com/geeks-accelerator/ollama-herd
分布式推理架构
分布式推理协调器 (:11435) 节点代理
┌──────────────────────┐ ┌──────────────────┐
│ 分布式评分 │◄────│ 心跳 + 指标 │ (mDNS 或显式 URL)
│ 推理队列管理器 │ │ 容量学习器 │
│ 流式代理 │ └──────────────────┘
│ 追踪存储 │ ┌──────────────────┐
│ 延迟存储 │ │ 心跳 + 指标 │ (N 个节点)
└──────────────────────┘ └──────────────────┘
│
▼
Ollama 实例 (每个分布式推理节点一个)
分布式推理节点通过 mDNS (fleet-manager.tcp.local.) 发现协调器,或使用 --router-url 显式连接。每个分布式推理节点每5秒发送一次心跳,包含:CPU利用率、内存使用和压力分类、磁盘指标、已加载模型及其上下文长度、可用模型,以及来自行为模型的可选容量评分。
分布式推理评分函数
分布式推理协调器对每个请求使用7个加权信号评估每个在线节点:
| 分布式推理信号 | 最大权重 | 测量内容 |
|---|
| 热状态 | +50 | 模型是否已加载到GPU内存?热(+50)、温(+30)、冷(+10) |
| 内存适配 |
+20 | 相对于模型大小的可用分布式推理内存余量 |
| 队列深度 | -30 | 此节点:模型对上的待处理+进行中分布式推理请求数 |
| 等待时间 | -25 | 基于p75历史延迟×队列深度估算的分布式推理等待时间 |
| 角色亲和性 | +15 | 大模型偏好高内存分布式推理节点 |
| 可用性趋势 | +10 | 容量学习器对分布式推理节点可用性的预测 |
| 上下文适配 | +15 | 已加载模型的上下文窗口是否适配估算的分布式推理token数量? |
内存不足、压力临界或缺少模型的分布式推理节点在评分前被排除。得分最高的分布式推理节点胜出。
自适应分布式推理容量
分布式推理节点可选地学习使用模式并约束其可用性:
- - 168槽位行为模型 — 每周每小时一个槽位,学习分布式推理机器通常空闲的时间
- 动态内存上限 — 将可用性评分映射到分布式推理协调器可使用的RAM量
在分布式推理节点代理上启用 FLEETNODEENABLECAPACITYLEARNING=true。
上下文感知的分布式推理模型放置
分布式推理协调器防范一个已知的Ollama行为:在运行时更改 num_ctx 会触发完整的模型重新加载。对于89GB的模型,这会导致数分钟的卡顿。
- - numctx ≤ 已加载上下文 → 从分布式推理请求中移除
- numctx > 已加载上下文 → 在所有分布式推理节点上搜索具有足够上下文的已加载模型
- 可配置:FLEETCONTEXTPROTECTION=strip|warn|passthrough
分布式推理API
分布式推理协调器状态
bash
distributedinferencefleet_state — 完整分布式推理拓扑
curl -s http://localhost:11435/fleet/status | python3 -m json.tool
distributedinferencemodels — 所有分布式推理节点上的模型
curl -s http://localhost:11435/api/tags | python3 -m json.tool
distributedinferencehot_models — GPU内存中的模型
curl -s http://localhost:11435/api/ps | python3 -m json.tool
分布式推理 (兼容OpenAI)
bash
distributedinferencechat — 通过分布式推理评分路由
curl -s http://localhost:11435/v1/chat/completions \
-H Content-Type: application/json \
-d {model:llama3.3:70b,messages:[{role:user,content:通过分布式推理打个招呼}]}
分布式推理 (Ollama原生)
bash
curl -s http://localhost:11435/api/chat \
-d {model:llama3.3:70b,messages:[{role:user,content:通过分布式推理打个招呼}]}
分布式推理模型回退链
bash
curl -s http://localhost:11435/v1/chat/completions \
-H Content-Type: application/json \
-d {model:llama3.3:70b,fallback_models:[qwen2.5:32b,qwen2.5:7b],messages:[{role:user,content:使用分布式推理回退打个招呼}]}
分布式推理追踪分析
bash
distributedinferencetraces — 最近的路由决策
curl -s http://localhost:11435/dashboard/api/traces?limit=20 | python3 -m json.tool
distributedinferencescore_breakdown
sqlite3 ~/.fleet-manager/latency.db SELECT request
id, model, nodeid, score, scores
breakdown FROM requesttraces ORDER BY timestamp DESC LIMIT 1
分布式推理节点性能
bash
sqlite3 ~/.fleet-manager/latency.db SELECT node
id, model, COUNT() as n, ROUND(AVG(latencyms)/1000.0, 1) as avgs, ROUND(AVG(COALESCE(completiontokens,0) 1000.0 / NULLIF(latency
ms,0)), 1) as tokper
s FROM requesttraces WHERE status=completed GROUP BY node
id, model HAVING n > 10 ORDER BY tokper_s DESC
分布式推理健康与容量
bash
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool
curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool
分布式推理模型生命周期
bash
distributedinferencemodel_inventory
curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool
拉取模型到分布式推理节点
curl -s -X POST http://localhost:11435/dashboard/api/pull \
-H Content-Type: application/json \
-d {model: llama3.3:70b, node_id: mac-studio}
从分布式推理节点移除模型
curl -s -X POST http://localhost:11435/dashboard/api/delete \
-H Content-Type: application/json \
-d {model: old-model:7b, node_id: mac-studio}
分布式推理容错
| 机制 | 分布式推理行为 |
|---|
| 自动重试 | 如果分布式推理节点在第一个数据块之前失败,重新评分并在次优节点上重试 |
| 保持队列 |
当所有分布式推理节点饱和时,请求排队最多30秒 |
| 僵尸回收器 | 后台任务回收卡住的分布式推理进行中槽位 |
| VRAM回退 | 路由到同一类别中已加载的模型,而不是冷加载 |
| 自动拉取 | 将缺失的模型拉取到可用内存最多的分布式推理节点上 |
| 优雅排空 | SIGTERM触发排空:进行中的分布式推理请求完成,待处理的重新分配 |
分布式推理数据模型
所有分布式推理状态存储在SQLite中,位于 ~/.fleet-manager/latency.db:
sql
-- 分布式推理请求追踪(每个路由决策)
SELECT * FROM request_traces LIMIT 1;
结构化分布式推理日志位于 ~/.fleet-manager/logs/herd.jsonl — 每日轮转,保留30天。
分布式推理仪表板
http://localhost:11435/dashboard — 八个标签页,涵盖分布式推理集群概览、趋势、模型洞察、按应用分析、