Ollama Load Balancer

You are managing an Ollama load balancer that distributes inference requests across multiple Ollama instances with automatic discovery, health monitoring, and failover. The load balancer handles all routing decisions transparently.

What the load balancer solves

Ollama has no built-in load balancing. One machine goes down, your app gets errors. No health checks, no failover, no queue management. You're manually pointing clients at specific machines and hoping they stay up.

This load balancer auto-discovers Ollama instances via mDNS, monitors their health continuously, and distributes load based on real-time scoring. The load balancer automatically retries on failure. Zero config files. Zero Docker. pip install ollama-herd, run two commands, and load balancing is active.

Deploy the load balancer

CODEBLOCK0

Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd

Load Balancer Endpoint

The load balancer runs at http://localhost:11435. Drop-in replacement for direct Ollama connections — same API, same model names, with load balancing built in.

CODEBLOCK1

Load Balancer Health Monitoring

Fleet-wide load balancer health check (15 automated checks)

CODEBLOCK2

The load balancer checks: offline nodes, degraded nodes, memory pressure, underutilized nodes, model thrashing, request timeouts, error rates. Each load balancer check returns severity (info/warning/critical) and recommendations.

Load balancer node status and metrics

CODEBLOCK3

Returns per-node: status (online/degraded/offline), CPU utilization, memory usage, loaded models with context lengths, and load balancer queue depths (pending/in-flight/done/failed).

Load balancer queue depths

CODEBLOCK4

Load Balancer Auto-Recovery

- Load balancer auto-retry — if a node fails before the first response chunk, the load balancer re-scores and retries on the next-best node (up to 2 retries, configurable via FLEET_MAX_RETRIES)
Load balancer zombie reaper — background task detects in-flight requests stuck longer than 10 minutes and cleans them up
Load balancer context protection — strips dangerous num_ctx parameters that would trigger model reloads
Load balancer VRAM-aware fallback — routes to an already-loaded model instead of triggering a cold load
Load balancer auto-pull — optionally pulls missing models (disabled by default, toggle via settings)
Load balancer holding queue — when all nodes are busy, requests wait (up to 30s) rather than failing

Load Balancer API Endpoints

Models available through the load balancer

CODEBLOCK5

Load balancer request traces

CODEBLOCK6

Load balancer usage statistics

CODEBLOCK7

Load balancer model recommendations

CODEBLOCK8

Load balancer settings (runtime toggles)

CODEBLOCK9

Load balancer model management

CODEBLOCK10

Load balancer per-app analytics

CODEBLOCK11

Load Balancer Dashboard

Web dashboard at http://localhost:11435/dashboard with eight tabs: Fleet Overview, Trends, Model Insights, Apps, Benchmarks, Health, Recommendations, Settings. All load balancer data updates in real-time via Server-Sent Events.

Load Balancer Operational Queries

Recent load balancer failures with error details

CODEBLOCK12

Load balancer retry frequency by node

CODEBLOCK13

Load balancer requests per hour

CODEBLOCK14

Test load balancer inference

CODEBLOCK15

Load Balancer Guardrails

- Never restart or stop the load balancer or node agents without explicit user confirmation.
Never delete or modify files in ~/.fleet-manager/ (contains load balancer latency data, traces, and logs).
Do not pull or delete models on load balancer nodes without user confirmation — downloads can be 10-100+ GB.
If a load balancer node shows as offline, report it rather than attempting to SSH into the machine.
If all load balancer nodes are saturated, suggest the user check the dashboard.

Load Balancer Failure Handling

- Connection refused → load balancer may not be running, suggest herd or INLINECODE8
0 nodes online → suggest starting herd-node on load balancer backend devices
mDNS discovery fails → use INLINECODE10
Load balancer requests hang → check for num_ctx in client requests; verify with INLINECODE12
Load balancer API errors → check INLINECODE13

Ollama 负载均衡器

您正在管理一个Ollama负载均衡器，它通过自动发现、健康监控和故障转移，将推理请求分发到多个Ollama实例。该负载均衡器透明地处理所有路由决策。

负载均衡器解决的问题

Ollama没有内置的负载均衡功能。一台机器宕机，您的应用就会报错。没有健康检查，没有故障转移，也没有队列管理。您只能手动将客户端指向特定机器，并祈祷它们保持在线。

此负载均衡器通过mDNS自动发现Ollama实例，持续监控其健康状态，并基于实时评分分配负载。负载均衡器在失败时自动重试。零配置文件。零Docker。pip install ollama-herd，运行两个命令，负载均衡即生效。

部署负载均衡器

bash
pip install ollama-herd
herd # 在端口11435上启动负载均衡器
herd-node # 在每台机器上启动负载均衡器后端节点

包：ollama-herd | 仓库：github.com/geeks-accelerator/ollama-herd

负载均衡器端点

负载均衡器运行在 http://localhost:11435。可直接替代直接连接Ollama的方式——相同的API，相同的模型名称，内置负载均衡功能。

python
from openai import OpenAI

负载均衡器客户端——请求在所有后端节点之间均衡分配

loadbalancerclient = OpenAI(baseurl=http://localhost:11435/v1, apikey=not-needed)
loadbalancedresponse = loadbalancerclient.chat.completions.create(
model=llama3.3:70b,
messages=[{role: user, content: 解释LLM推理的负载均衡}]
)

负载均衡器健康监控

集群级负载均衡器健康检查（15项自动化检查）

bash curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

负载均衡器检查：离线节点、降级节点、内存压力、利用率不足的节点、模型抖动、请求超时、错误率。每项负载均衡器检查返回严重级别（信息/警告/严重）和建议。

负载均衡器节点状态和指标

bash curl -s http://localhost:11435/fleet/status | python3 -m json.tool

返回每个节点：状态（在线/降级/离线）、CPU利用率、内存使用、已加载模型及上下文长度、负载均衡器队列深度（待处理/处理中/已完成/失败）。

负载均衡器队列深度

bash curl -s http://localhost:11435/fleet/status | python3 -c import sys, json

负载均衡器队列检查

data = json.load(sys.stdin) for key, q in data.get(queues, {}).items(): print(f\{key}: {q[pending]} 待处理, {q[inflight]}/{q[maxconcurrent]} 处理中\)

负载均衡器自动恢复

- 负载均衡器自动重试 — 如果节点在第一个响应块之前失败，负载均衡器会重新评分并在次优节点上重试（最多2次重试，可通过 FLEETMAXRETRIES 配置）
负载均衡器僵尸收割者 — 后台任务检测处理中超过10分钟的请求并清理
负载均衡器上下文保护 — 剥离会触发模型重新加载的危险 num_ctx 参数
负载均衡器VRAM感知回退 — 路由到已加载的模型，而不是触发冷加载
负载均衡器自动拉取 — 可选地拉取缺失模型（默认禁用，通过设置切换）
负载均衡器等待队列 — 当所有节点繁忙时，请求等待（最多30秒）而不是失败

负载均衡器API端点

通过负载均衡器可用的模型

bash

负载均衡集群中的所有模型

curl -s http://localhost:11435/api/tags | python3 -m json.tool

当前加载在负载均衡器后端内存中的模型

curl -s http://localhost:11435/api/ps | python3 -m json.tool

通过负载均衡器的OpenAI兼容模型列表

curl -s http://localhost:11435/v1/models | python3 -m json.tool

负载均衡器请求追踪

bash curl -s http://localhost:11435/dashboard/api/traces?limit=20 | python3 -m json.tool

负载均衡器使用统计

bash curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool

负载均衡器模型推荐

bash curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool

负载均衡器设置（运行时切换）

bash

查看负载均衡器配置

curl -s http://localhost:11435/dashboard/api/settings | python3 -m json.tool

切换负载均衡器功能

curl -s -X POST http://localhost:11435/dashboard/api/settings \ -H Content-Type: application/json \ -d {auto_pull: false}

负载均衡器模型管理

bash

查看负载均衡器后端的每个节点模型详情

curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool

向负载均衡器后端节点拉取模型

curl -s -X POST http://localhost:11435/dashboard/api/pull \ -H Content-Type: application/json \ -d {model: llama3.3:70b, node_id: load-balancer-node-1}

从负载均衡器节点删除模型

curl -s -X POST http://localhost:11435/dashboard/api/delete \ -H Content-Type: application/json \ -d {model: old-model:7b, node_id: load-balancer-node-1}

负载均衡器按应用分析

bash curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool

负载均衡器仪表板

Web仪表板位于 http://localhost:11435/dashboard，包含八个标签页：集群概览、趋势、模型洞察、应用、基准测试、健康、推荐、设置。所有负载均衡器数据通过服务器发送事件实时更新。

负载均衡器运维查询

最近的负载均衡器失败及错误详情

bash sqlite3 ~/.fleet-manager/latency.db SELECT requestid, model, status, errormessage, latencyms/1000.0 as secs FROM requesttraces WHERE status=failed ORDER BY timestamp DESC LIMIT 10

按节点统计的负载均衡器重试频率

bash sqlite3 ~/.fleet-manager/latency.db SELECT nodeid, SUM(retrycount) as retries, COUNT(*) as total FROM requesttraces GROUP BY nodeid ORDER BY retries DESC

负载均衡器每小时请求数

bash sqlite3 ~/.fleet-manager/latency.db SELECT CAST((timestamp % 86400) / 3600 AS INTEGER) as hour, COUNT(*) as requests FROM request_traces GROUP BY hour ORDER BY hour

测试负载均衡器推理

bash curl -s http://localhost:11435/v1/chat/completions \ -H Content-Type: application/json \ -d {model:llama3.3:70b,messages:[{role:user,content:测试跨节点负载均衡}],stream:false}

curl -s http://localhost:11435/api/chat \
-d {model:llama3.3:70b,messages:[{role:user,content:验证负载均衡器路由}],stream:false}

负载均衡器安全护栏

- 未经用户明确确认，切勿重启或停止负载均衡器或节点代理。
切勿删除或修改 ~/.fleet-manager/ 中的文件（包含负载均衡器延迟数据、追踪和日志）。
未经用户确认，不要在负载均衡器节点上拉取或删除模型——下载可能达到10-100+ GB。
如果负载均衡器节点显示为离线，请报告而不是尝试SSH进入该机器。
如果所有负载均衡器节点都已饱和，建议用户检查仪表板。

负载均衡器故障处理

- 连接被拒绝 → 负载均衡器可能未运行，建议运行 herd 或 uv run herd
0个节点在线 → 建议在负载均衡器后端设备

ollama-load-balancerOllama负载均衡