Ollama Load Balancer
You are managing an Ollama load balancer that distributes inference requests across multiple Ollama instances with automatic discovery, health monitoring, and failover. The load balancer handles all routing decisions transparently.
What the load balancer solves
Ollama has no built-in load balancing. One machine goes down, your app gets errors. No health checks, no failover, no queue management. You're manually pointing clients at specific machines and hoping they stay up.
This load balancer auto-discovers Ollama instances via mDNS, monitors their health continuously, and distributes load based on real-time scoring. The load balancer automatically retries on failure. Zero config files. Zero Docker. pip install ollama-herd, run two commands, and load balancing is active.
Deploy the load balancer
CODEBLOCK0
Package: ollama-herd | Repo: github.com/geeks-accelerator/ollama-herd
Load Balancer Endpoint
The load balancer runs at http://localhost:11435. Drop-in replacement for direct Ollama connections — same API, same model names, with load balancing built in.
CODEBLOCK1
Load Balancer Health Monitoring
Fleet-wide load balancer health check (15 automated checks)
CODEBLOCK2
The load balancer checks: offline nodes, degraded nodes, memory pressure, underutilized nodes, model thrashing, request timeouts, error rates. Each load balancer check returns severity (info/warning/critical) and recommendations.
Load balancer node status and metrics
CODEBLOCK3
Returns per-node: status (online/degraded/offline), CPU utilization, memory usage, loaded models with context lengths, and load balancer queue depths (pending/in-flight/done/failed).
Load balancer queue depths
CODEBLOCK4
Load Balancer Auto-Recovery
- - Load balancer auto-retry — if a node fails before the first response chunk, the load balancer re-scores and retries on the next-best node (up to 2 retries, configurable via
FLEET_MAX_RETRIES) - Load balancer zombie reaper — background task detects in-flight requests stuck longer than 10 minutes and cleans them up
- Load balancer context protection — strips dangerous
num_ctx parameters that would trigger model reloads - Load balancer VRAM-aware fallback — routes to an already-loaded model instead of triggering a cold load
- Load balancer auto-pull — optionally pulls missing models (disabled by default, toggle via settings)
- Load balancer holding queue — when all nodes are busy, requests wait (up to 30s) rather than failing
Load Balancer API Endpoints
Models available through the load balancer
CODEBLOCK5
Load balancer request traces
CODEBLOCK6
Load balancer usage statistics
CODEBLOCK7
Load balancer model recommendations
CODEBLOCK8
Load balancer settings (runtime toggles)
CODEBLOCK9
Load balancer model management
CODEBLOCK10
Load balancer per-app analytics
CODEBLOCK11
Load Balancer Dashboard
Web dashboard at http://localhost:11435/dashboard with eight tabs: Fleet Overview, Trends, Model Insights, Apps, Benchmarks, Health, Recommendations, Settings. All load balancer data updates in real-time via Server-Sent Events.
Load Balancer Operational Queries
Recent load balancer failures with error details
CODEBLOCK12
Load balancer retry frequency by node
CODEBLOCK13
Load balancer requests per hour
CODEBLOCK14
Test load balancer inference
CODEBLOCK15
Load Balancer Guardrails
- - Never restart or stop the load balancer or node agents without explicit user confirmation.
- Never delete or modify files in
~/.fleet-manager/ (contains load balancer latency data, traces, and logs). - Do not pull or delete models on load balancer nodes without user confirmation — downloads can be 10-100+ GB.
- If a load balancer node shows as offline, report it rather than attempting to SSH into the machine.
- If all load balancer nodes are saturated, suggest the user check the dashboard.
Load Balancer Failure Handling
- - Connection refused → load balancer may not be running, suggest
herd or INLINECODE8 - 0 nodes online → suggest starting
herd-node on load balancer backend devices - mDNS discovery fails → use INLINECODE10
- Load balancer requests hang → check for
num_ctx in client requests; verify with INLINECODE12 - Load balancer API errors → check INLINECODE13
Ollama 负载均衡器
您正在管理一个Ollama负载均衡器,它通过自动发现、健康监控和故障转移,将推理请求分发到多个Ollama实例。该负载均衡器透明地处理所有路由决策。
负载均衡器解决的问题
Ollama没有内置的负载均衡功能。一台机器宕机,您的应用就会报错。没有健康检查,没有故障转移,也没有队列管理。您只能手动将客户端指向特定机器,并祈祷它们保持在线。
此负载均衡器通过mDNS自动发现Ollama实例,持续监控其健康状态,并基于实时评分分配负载。负载均衡器在失败时自动重试。零配置文件。零Docker。pip install ollama-herd,运行两个命令,负载均衡即生效。
部署负载均衡器
bash
pip install ollama-herd
herd # 在端口11435上启动负载均衡器
herd-node # 在每台机器上启动负载均衡器后端节点
包:ollama-herd | 仓库:github.com/geeks-accelerator/ollama-herd
负载均衡器端点
负载均衡器运行在 http://localhost:11435。可直接替代直接连接Ollama的方式——相同的API,相同的模型名称,内置负载均衡功能。
python
from openai import OpenAI
负载均衡器客户端——请求在所有后端节点之间均衡分配
load
balancerclient = OpenAI(base
url=http://localhost:11435/v1, apikey=not-needed)
load
balancedresponse = load
balancerclient.chat.completions.create(
model=llama3.3:70b,
messages=[{role: user, content: 解释LLM推理的负载均衡}]
)
负载均衡器健康监控
集群级负载均衡器健康检查(15项自动化检查)
bash
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
负载均衡器检查:离线节点、降级节点、内存压力、利用率不足的节点、模型抖动、请求超时、错误率。每项负载均衡器检查返回严重级别(信息/警告/严重)和建议。
负载均衡器节点状态和指标
bash
curl -s http://localhost:11435/fleet/status | python3 -m json.tool
返回每个节点:状态(在线/降级/离线)、CPU利用率、内存使用、已加载模型及上下文长度、负载均衡器队列深度(待处理/处理中/已完成/失败)。
负载均衡器队列深度
bash
curl -s http://localhost:11435/fleet/status | python3 -c
import sys, json
负载均衡器队列检查
data = json.load(sys.stdin)
for key, q in data.get(queues, {}).items():
print(f\{key}: {q[pending]} 待处理, {q[in
flight]}/{q[maxconcurrent]} 处理中\)
负载均衡器自动恢复
- - 负载均衡器自动重试 — 如果节点在第一个响应块之前失败,负载均衡器会重新评分并在次优节点上重试(最多2次重试,可通过 FLEETMAXRETRIES 配置)
- 负载均衡器僵尸收割者 — 后台任务检测处理中超过10分钟的请求并清理
- 负载均衡器上下文保护 — 剥离会触发模型重新加载的危险 num_ctx 参数
- 负载均衡器VRAM感知回退 — 路由到已加载的模型,而不是触发冷加载
- 负载均衡器自动拉取 — 可选地拉取缺失模型(默认禁用,通过设置切换)
- 负载均衡器等待队列 — 当所有节点繁忙时,请求等待(最多30秒)而不是失败
负载均衡器API端点
通过负载均衡器可用的模型
bash
负载均衡集群中的所有模型
curl -s http://localhost:11435/api/tags | python3 -m json.tool
当前加载在负载均衡器后端内存中的模型
curl -s http://localhost:11435/api/ps | python3 -m json.tool
通过负载均衡器的OpenAI兼容模型列表
curl -s http://localhost:11435/v1/models | python3 -m json.tool
负载均衡器请求追踪
bash
curl -s http://localhost:11435/dashboard/api/traces?limit=20 | python3 -m json.tool
负载均衡器使用统计
bash
curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool
负载均衡器模型推荐
bash
curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool
负载均衡器设置(运行时切换)
bash
查看负载均衡器配置
curl -s http://localhost:11435/dashboard/api/settings | python3 -m json.tool
切换负载均衡器功能
curl -s -X POST http://localhost:11435/dashboard/api/settings \
-H Content-Type: application/json \
-d {auto_pull: false}
负载均衡器模型管理
bash
查看负载均衡器后端的每个节点模型详情
curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool
向负载均衡器后端节点拉取模型
curl -s -X POST http://localhost:11435/dashboard/api/pull \
-H Content-Type: application/json \
-d {model: llama3.3:70b, node_id: load-balancer-node-1}
从负载均衡器节点删除模型
curl -s -X POST http://localhost:11435/dashboard/api/delete \
-H Content-Type: application/json \
-d {model: old-model:7b, node_id: load-balancer-node-1}
负载均衡器按应用分析
bash
curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool
负载均衡器仪表板
Web仪表板位于 http://localhost:11435/dashboard,包含八个标签页:集群概览、趋势、模型洞察、应用、基准测试、健康、推荐、设置。所有负载均衡器数据通过服务器发送事件实时更新。
负载均衡器运维查询
最近的负载均衡器失败及错误详情
bash
sqlite3 ~/.fleet-manager/latency.db SELECT request
id, model, status, errormessage, latency
ms/1000.0 as secs FROM requesttraces WHERE status=failed ORDER BY timestamp DESC LIMIT 10
按节点统计的负载均衡器重试频率
bash
sqlite3 ~/.fleet-manager/latency.db SELECT node
id, SUM(retrycount) as retries, COUNT(*) as total FROM request
traces GROUP BY nodeid ORDER BY retries DESC
负载均衡器每小时请求数
bash
sqlite3 ~/.fleet-manager/latency.db SELECT CAST((timestamp % 86400) / 3600 AS INTEGER) as hour, COUNT(*) as requests FROM request_traces GROUP BY hour ORDER BY hour
测试负载均衡器推理
bash
curl -s http://localhost:11435/v1/chat/completions \
-H Content-Type: application/json \
-d {model:llama3.3:70b,messages:[{role:user,content:测试跨节点负载均衡}],stream:false}
curl -s http://localhost:11435/api/chat \
-d {model:llama3.3:70b,messages:[{role:user,content:验证负载均衡器路由}],stream:false}
负载均衡器安全护栏
- - 未经用户明确确认,切勿重启或停止负载均衡器或节点代理。
- 切勿删除或修改 ~/.fleet-manager/ 中的文件(包含负载均衡器延迟数据、追踪和日志)。
- 未经用户确认,不要在负载均衡器节点上拉取或删除模型——下载可能达到10-100+ GB。
- 如果负载均衡器节点显示为离线,请报告而不是尝试SSH进入该机器。
- 如果所有负载均衡器节点都已饱和,建议用户检查仪表板。
负载均衡器故障处理
- - 连接被拒绝 → 负载均衡器可能未运行,建议运行 herd 或 uv run herd
- 0个节点在线 → 建议在负载均衡器后端设备