Distributed Inference

A coordination layer for distributed inference across heterogeneous machines. Each node is autonomous — it runs its own Ollama, manages its own models, and works fine standalone. The distributed inference coordinator routes requests to the optimal node using a multi-signal distributed inference scoring function and records every distributed inference decision for analysis.

Install Distributed Inference

CODEBLOCK0

Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd

Distributed Inference Architecture

CODEBLOCK1

Distributed inference nodes discover the coordinator via mDNS (_fleet-manager._tcp.local.) or connect explicitly with --router-url. Each distributed inference node sends heartbeats every 5 seconds containing: CPU utilization, memory usage and pressure classification, disk metrics, loaded models with context lengths, available models, and an optional capacity score from the behavioral model.

Distributed Inference Scoring Function

The distributed inference coordinator evaluates every online node for every request using 7 weighted signals:

Distributed Inference Signal	Max Weight	What it measures
Thermal state	+50	Is the model already loaded in GPU memory? Hot (+50), warm (+30), cold (+10)
Memory fit

Distributed inference nodes with insufficient memory, critical pressure, or missing models are eliminated before scoring. The highest-scoring distributed inference node wins.

Adaptive Distributed Inference Capacity

Distributed inference nodes optionally learn usage patterns and constrain their availability:

- 168-slot behavioral model — one slot per hour of the week, learns when the distributed inference machine is typically free
Dynamic memory ceiling — maps availability score to how much RAM the distributed inference coordinator can use

Enable with FLEET_NODE_ENABLE_CAPACITY_LEARNING=true on the distributed inference node agent.

Context-aware Distributed Inference Model Placement

The distributed inference coordinator protects against a known Ollama behavior where changing num_ctx at runtime triggers a full model reload. For an 89GB model, this causes multi-minute hangs.

- num_ctx ≤ loaded context → stripped from the distributed inference request
INLINECODE6 > loaded context → searches loaded models across all distributed inference nodes for sufficient context
Configurable: INLINECODE7

Distributed Inference API

Distributed Inference Coordinator State

CODEBLOCK2

Distributed Inference (OpenAI-compatible)

CODEBLOCK3

Distributed Inference (Ollama-native)

CODEBLOCK4

Distributed Inference Model Fallback Chains

CODEBLOCK5

Distributed Inference Trace Analysis

CODEBLOCK6

Distributed Inference Node Performance

CODEBLOCK7

Distributed Inference Health and Capacity

CODEBLOCK8

Distributed Inference Model Lifecycle

CODEBLOCK9

Distributed Inference Fault Tolerance

Mechanism	Distributed Inference Behavior
Auto-retry	If a distributed inference node fails before the first chunk, re-score and retry on next-best node
Holding queue

When all distributed inference nodes are saturated, requests queue for up to 30 seconds | | Zombie reaper | Background task reclaims stuck distributed inference in-flight slots | | VRAM fallback | Routes to a loaded model in the same category rather than cold-loading | | Auto-pull | Pulls missing models onto the distributed inference node with the most available memory | | Graceful drain | SIGTERM triggers drain: in-flight distributed inference requests finish, pending redistribute |

Distributed Inference Data Model

All distributed inference state is in SQLite at ~/.fleet-manager/latency.db:

CODEBLOCK10

Structured distributed inference logs at ~/.fleet-manager/logs/herd.jsonl — daily rotation, 30-day retention.

Distributed Inference Dashboard

INLINECODE10 — eight tabs covering distributed inference fleet overview, trends, model insights, per-app analytics, benchmarks, health checks, model recommendations, and settings.

Distributed Inference Constraints

- Never restart distributed inference services or modify ~/.fleet-manager/ without explicit user confirmation.
Distributed inference model pull/delete operations require user confirmation (10-100+ GB transfers).
If the distributed inference coordinator is unreachable, suggest herd or uv run herd.
If no distributed inference nodes are online, suggest herd-node on target machines.

分布式推理

跨异构机器的分布式推理协调层。每个节点都是自治的——它运行自己的Ollama，管理自己的模型，并且可以独立正常工作。分布式推理协调器使用多信号分布式推理评分函数将请求路由到最优节点，并记录每个分布式推理决策以供分析。

安装分布式推理

bash
pip install ollama-herd
herd # 启动分布式推理协调器
herd-node # 在每个节点上启动分布式推理代理

包：ollama-herd | 仓库：github.com/geeks-accelerator/ollama-herd

分布式推理架构

分布式推理协调器 (:11435) 节点代理
┌──────────────────────┐ ┌──────────────────┐
│ 分布式评分 │◄────│ 心跳 + 指标 │ (mDNS 或显式 URL)
│ 推理队列管理器 │ │ 容量学习器 │
│ 流式代理 │ └──────────────────┘
│ 追踪存储 │ ┌──────────────────┐
│ 延迟存储 │ │ 心跳 + 指标 │ (N 个节点)
└──────────────────────┘ └──────────────────┘
│
▼
Ollama 实例 (每个分布式推理节点一个)

分布式推理节点通过 mDNS (fleet-manager.tcp.local.) 发现协调器，或使用 --router-url 显式连接。每个分布式推理节点每5秒发送一次心跳，包含：CPU利用率、内存使用和压力分类、磁盘指标、已加载模型及其上下文长度、可用模型，以及来自行为模型的可选容量评分。

分布式推理评分函数

分布式推理协调器对每个请求使用7个加权信号评估每个在线节点：

分布式推理信号	最大权重	测量内容
热状态	+50	模型是否已加载到GPU内存？热(+50)、温(+30)、冷(+10)
内存适配

内存不足、压力临界或缺少模型的分布式推理节点在评分前被排除。得分最高的分布式推理节点胜出。

自适应分布式推理容量

分布式推理节点可选地学习使用模式并约束其可用性：

- 168槽位行为模型 — 每周每小时一个槽位，学习分布式推理机器通常空闲的时间
动态内存上限 — 将可用性评分映射到分布式推理协调器可使用的RAM量

在分布式推理节点代理上启用 FLEETNODEENABLECAPACITYLEARNING=true。

上下文感知的分布式推理模型放置

分布式推理协调器防范一个已知的Ollama行为：在运行时更改 num_ctx 会触发完整的模型重新加载。对于89GB的模型，这会导致数分钟的卡顿。

- numctx ≤ 已加载上下文 → 从分布式推理请求中移除
numctx > 已加载上下文 → 在所有分布式推理节点上搜索具有足够上下文的已加载模型
可配置：FLEETCONTEXTPROTECTION=strip|warn|passthrough

分布式推理API

分布式推理协调器状态

bash

distributedinferencefleet_state — 完整分布式推理拓扑

curl -s http://localhost:11435/fleet/status | python3 -m json.tool

distributedinferencemodels — 所有分布式推理节点上的模型

curl -s http://localhost:11435/api/tags | python3 -m json.tool

distributedinferencehot_models — GPU内存中的模型

curl -s http://localhost:11435/api/ps | python3 -m json.tool

分布式推理 (兼容OpenAI)

bash

distributedinferencechat — 通过分布式推理评分路由

curl -s http://localhost:11435/v1/chat/completions \ -H Content-Type: application/json \ -d {model:llama3.3:70b,messages:[{role:user,content:通过分布式推理打个招呼}]}

分布式推理 (Ollama原生)

bash curl -s http://localhost:11435/api/chat \ -d {model:llama3.3:70b,messages:[{role:user,content:通过分布式推理打个招呼}]}

分布式推理模型回退链

bash curl -s http://localhost:11435/v1/chat/completions \ -H Content-Type: application/json \ -d {model:llama3.3:70b,fallback_models:[qwen2.5:32b,qwen2.5:7b],messages:[{role:user,content:使用分布式推理回退打个招呼}]}

分布式推理追踪分析

bash

distributedinferencetraces — 最近的路由决策

curl -s http://localhost:11435/dashboard/api/traces?limit=20 | python3 -m json.tool

distributedinferencescore_breakdown

sqlite3 ~/.fleet-manager/latency.db SELECT requestid, model, nodeid, score, scoresbreakdown FROM requesttraces ORDER BY timestamp DESC LIMIT 1

分布式推理节点性能

bash sqlite3 ~/.fleet-manager/latency.db SELECT nodeid, model, COUNT() as n, ROUND(AVG(latencyms)/1000.0, 1) as avgs, ROUND(AVG(COALESCE(completiontokens,0) 1000.0 / NULLIF(latencyms,0)), 1) as tokpers FROM requesttraces WHERE status=completed GROUP BY nodeid, model HAVING n > 10 ORDER BY tokper_s DESC

分布式推理健康与容量

bash curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool

分布式推理模型生命周期

bash

distributedinferencemodel_inventory

curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool

拉取模型到分布式推理节点

curl -s -X POST http://localhost:11435/dashboard/api/pull \ -H Content-Type: application/json \ -d {model: llama3.3:70b, node_id: mac-studio}

从分布式推理节点移除模型

curl -s -X POST http://localhost:11435/dashboard/api/delete \ -H Content-Type: application/json \ -d {model: old-model:7b, node_id: mac-studio}

分布式推理容错

机制	分布式推理行为
自动重试	如果分布式推理节点在第一个数据块之前失败，重新评分并在次优节点上重试
保持队列

分布式推理数据模型

所有分布式推理状态存储在SQLite中，位于 ~/.fleet-manager/latency.db：

sql
-- 分布式推理请求追踪（每个路由决策）
SELECT * FROM request_traces LIMIT 1;

结构化分布式推理日志位于 ~/.fleet-manager/logs/herd.jsonl — 每日轮转，保留30天。

分布式推理仪表板

http://localhost:11435/dashboard — 八个标签页，涵盖分布式推理集群概览、趋势、模型洞察、按应用分析、

distributed-inference分布式推理

distributed-inference

Distributed Inference

Install Distributed Inference

Distributed Inference Architecture

Distributed Inference Scoring Function

Adaptive Distributed Inference Capacity

Context-aware Distributed Inference Model Placement

Distributed Inference API

Distributed Inference Coordinator State

Distributed Inference (OpenAI-compatible)

Distributed Inference (Ollama-native)

Distributed Inference Model Fallback Chains

Distributed Inference Trace Analysis

Distributed Inference Node Performance

Distributed Inference Health and Capacity

Distributed Inference Model Lifecycle

Distributed Inference Fault Tolerance

Distributed Inference Data Model

Distributed Inference Dashboard

Distributed Inference Constraints

分布式推理

安装分布式推理

分布式推理架构

分布式推理评分函数

自适应分布式推理容量

上下文感知的分布式推理模型放置

分布式推理API

分布式推理协调器状态

distributedinferencefleet_state — 完整分布式推理拓扑

distributedinferencemodels — 所有分布式推理节点上的模型

distributedinferencehot_models — GPU内存中的模型

分布式推理 (兼容OpenAI)

distributedinferencechat — 通过分布式推理评分路由

分布式推理 (Ollama原生)

分布式推理模型回退链

分布式推理追踪分析

distributedinferencetraces — 最近的路由决策

distributedinferencescore_breakdown

分布式推理节点性能

分布式推理健康与容量

分布式推理模型生命周期

distributedinferencemodel_inventory

拉取模型到分布式推理节点

从分布式推理节点移除模型

分布式推理容错

分布式推理数据模型

分布式推理仪表板

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement