Ollama Herd Fleet Manager
You are managing an Ollama Herd fleet — a smart Ollama multimodal router that distributes Ollama AI workloads across multiple devices. Ollama Herd handles 4 model types: Ollama LLM inference, image generation (mflux), speech-to-text (Qwen3-ASR), and Ollama embeddings. The Ollama scoring engine evaluates nodes on 7 signals (thermal state, memory fit, queue depth, latency history, role affinity, availability trend, context fit) and routes each Ollama request to the optimal device.
Install Ollama Herd
CODEBLOCK0
PyPI: ollama-herd | Source: github.com/geeks-accelerator/ollama-herd
Ollama Router endpoint
The Ollama Herd router runs at http://localhost:11435 by default. If the user has specified a different Ollama URL, use that instead.
Ollama API endpoints
Use curl to interact with the Ollama fleet:
Ollama fleet status — overview of all Ollama nodes and queues
CODEBLOCK1
Returns:
- -
fleet.nodes_total / fleet.nodes_online — how many Ollama devices are in the fleet - INLINECODE4 — total Ollama models currently loaded across all nodes
- INLINECODE5 — total in-flight Ollama requests
- INLINECODE6 — per-node details: Ollama status, hardware, memory, CPU, disk, loaded Ollama models with context lengths
- INLINECODE7 — per Ollama node:model queue depths (pending, in-flight, done, failed)
List all Ollama models available across the fleet
CODEBLOCK2
Pull an Ollama model onto the fleet
CODEBLOCK3
List Ollama models currently loaded in memory
CODEBLOCK4
OpenAI-compatible Ollama model list
CODEBLOCK5
Ollama usage statistics (per-node, per-model daily aggregates)
CODEBLOCK6
Recent Ollama request traces
CODEBLOCK7
Returns the last N Ollama routing decisions with: model requested, node selected, score, latency, tokens, retry/fallback status, tags.
Ollama fleet health analysis
CODEBLOCK8
Returns 15 automated Ollama health checks: offline/degraded nodes, memory pressure, underutilized nodes, VRAM fallbacks, KV cache bloat (OLLAMANUMPARALLEL too high), version mismatch, context protection, zombie reaper, Ollama model thrashing, request timeouts, error rates, retry rates, client disconnects, and incomplete streams.
Ollama model recommendations
CODEBLOCK9
Returns AI-powered Ollama model mix recommendations per node based on hardware capabilities, Ollama usage patterns, and curated benchmark data.
Ollama settings
CODEBLOCK10
Ollama model management
CODEBLOCK11
Ollama model insights (summary statistics)
CODEBLOCK12
Per-app Ollama analytics (requires request tagging)
CODEBLOCK13
Ollama Dashboard
The Ollama web dashboard is at http://localhost:11435/dashboard. It has eight tabs:
- - Fleet Overview — live Ollama node cards, queue depths, and request counts via SSE
- Trends — Ollama requests per hour, average latency, and token throughput charts (24h–7d)
- Model Insights — per-Ollama-model latency, tokens/sec, usage comparison
- Apps — per-tag Ollama analytics with request volume, latency, tokens, error rates
- Benchmarks — Ollama capacity growth over time with per-run throughput and latency percentiles
- Health — 15 automated Ollama fleet health checks with severity levels
- Recommendations — Ollama model mix recommendations per node with one-click pull
- Settings — Ollama runtime toggle switches, read-only config tables, and node version tracking
Direct the user to open this URL in their browser for visual Ollama monitoring.
Ollama Resilience features
- - Auto-retry — if an Ollama node fails before the first response chunk, re-scores and retries on the next-best Ollama node (up to 2 retries)
- Ollama model fallbacks — clients specify backup Ollama models; tries alternatives when the primary is unavailable
- Context protection — strips
num_ctx from Ollama requests when unnecessary to prevent Ollama model reload hangs; auto-upgrades to a larger loaded model - VRAM-aware fallback — routes to an already-loaded Ollama model in the same category instead of cold-loading
- Zombie reaper — background task detects and cleans up stuck in-flight Ollama requests
- Auto-pull — automatically pulls missing Ollama models onto the best available node
Common Ollama tasks
Check if the Ollama fleet is healthy
- 1. Hit
/fleet/status and verify INLINECODE11 - Hit
/dashboard/api/health for automated Ollama health checks with severity levels - Look at Ollama queue depths — deep queues may indicate a bottleneck
Find which Ollama node has a specific model
- 1. Hit
/fleet/status and inspect each Ollama node's ollama.models_loaded and INLINECODE15 - Or hit
/api/tags for a flat list of all available Ollama models with which nodes have them
Check if an Ollama model is loaded (hot) or cold
- 1. Hit
/api/ps — Ollama models listed here are currently loaded in memory (hot) - Models in
/api/tags but not in /api/ps are on disk but not loaded (cold)
View recent Ollama inference activity
- 1. Hit
/dashboard/api/traces?limit=10 to see the last 10 Ollama requests - Each trace shows: Ollama model, node, score, latency, tokens, retry/fallback status
Diagnose slow Ollama responses
- 1. Check
/dashboard/api/traces for high latency Ollama entries - Check
/fleet/status for Ollama nodes with high queue depths or memory pressure - Check if the Ollama model had to cold-load (look for low scores in trace)
- Check if
num_ctx is being sent — Ollama context protection logs show if requests triggered reloads
Query the Ollama trace database directly
CODEBLOCK14
Test Ollama inference through the fleet
CODEBLOCK15
Ollama Guardrails
- - Never restart or stop the Ollama Herd router or Ollama node agents without explicit user confirmation.
- Never delete or modify files in
~/.fleet-manager/ (contains Ollama latency data, traces, and logs). - Do not pull Ollama models onto nodes without user confirmation — Ollama model downloads can be large (10-100+ GB).
- Do not delete Ollama models without user confirmation.
- If an Ollama node shows as offline, report it to the user rather than attempting to SSH into the machine.
Ollama Failure handling
- - If curl to the Ollama router fails with connection refused, tell the user the Ollama Herd router may not be running and suggest
herd to start it. - If the Ollama fleet status shows 0 nodes online, suggest starting Ollama node agents with
herd-node on their devices. - If Ollama mDNS discovery fails, suggest using
--router-url http://router-ip:11435 for explicit connection. - If Ollama requests hang with 0 bytes returned, check if the client is sending
num_ctx — Ollama context protection should strip it. - If a specific Ollama API endpoint returns an error, show the user the full error response and suggest checking the Ollama JSONL logs at
~/.fleet-manager/logs/herd.jsonl.
Ollama Herd 舰队管理器
您正在管理一个Ollama Herd舰队——一个智能的Ollama多模态路由器,可将Ollama AI工作负载分发到多个设备。Ollama Herd处理4种模型类型:Ollama LLM推理、图像生成(mflux)、语音转文本(Qwen3-ASR)和Ollama嵌入。Ollama评分引擎根据7个信号(热状态、内存适配度、队列深度、延迟历史、角色亲和性、可用性趋势、上下文适配度)评估节点,并将每个Ollama请求路由到最佳设备。
安装Ollama Herd
bash
pip install ollama-herd # 从PyPI安装Ollama Herd
herd # 启动Ollama路由器
herd-node # 启动Ollama节点代理(在每个设备上运行)
PyPI:ollama-herd | 源码:github.com/geeks-accelerator/ollama-herd
Ollama路由器端点
Ollama Herd路由器默认运行在http://localhost:11435。如果用户指定了不同的Ollama URL,则使用该URL。
Ollama API端点
使用curl与Ollama舰队交互:
Ollama舰队状态——所有Ollama节点和队列概览
bash
ollamafleetstatus — 检查Ollama节点健康状态
curl -s http://localhost:11435/fleet/status | python3 -m json.tool
返回:
- - fleet.nodestotal / fleet.nodesonline — 舰队中Ollama设备数量
- fleet.modelsloaded — 当前所有节点上加载的Ollama模型总数
- fleet.requestsactive — 正在进行的Ollama请求总数
- nodes[] — 每个节点的详细信息:Ollama状态、硬件、内存、CPU、磁盘、已加载的Ollama模型及上下文长度
- queues — 每个Ollama节点:模型的队列深度(待处理、进行中、已完成、失败)
列出舰队中所有可用的Ollama模型
bash
ollamamodellist — 所有节点上的所有Ollama模型
curl -s http://localhost:11435/api/tags | python3 -m json.tool
将Ollama模型拉取到舰队
bash
ollamapullmodel — 拉取模型(自动选择最佳节点,流式传输进度)
curl -N http://localhost:11435/api/pull -d {name: codestral}
拉取到特定节点
curl -N http://localhost:11435/api/pull -d {name: llama3.3:70b, node_id: mac-studio}
非流式传输(阻塞直到完成)
curl http://localhost:11435/api/pull -d {name: phi4, stream: false}
列出当前加载到内存中的Ollama模型
bash
ollamaloadedmodels — GPU内存中的热Ollama模型
curl -s http://localhost:11435/api/ps | python3 -m json.tool
兼容OpenAI的Ollama模型列表
bash
curl -s http://localhost:11435/v1/models | python3 -m json.tool
Ollama使用统计(每个节点、每个模型的每日汇总)
bash
curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool
最近的Ollama请求追踪
bash
ollama_traces — 最近的Ollama路由决策
curl -s http://localhost:11435/dashboard/api/traces?limit=20 | python3 -m json.tool
返回最近N个Ollama路由决策,包含:请求的模型、选择的节点、评分、延迟、令牌数、重试/回退状态、标签。
Ollama舰队健康分析
bash
curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool
返回15项自动化Ollama健康检查:离线/降级节点、内存压力、未充分利用的节点、VRAM回退、KV缓存膨胀(OLLAMANUMPARALLEL过高)、版本不匹配、上下文保护、僵尸清理、Ollama模型抖动、请求超时、错误率、重试率、客户端断开连接和不完整流。
Ollama模型推荐
bash
curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool
返回基于硬件能力、Ollama使用模式和精选基准数据的每个节点AI驱动Ollama模型组合推荐。
Ollama设置
bash
查看当前Ollama配置和节点版本
curl -s http://localhost:11435/dashboard/api/settings | python3 -m json.tool
切换Ollama运行时设置(autopull, vramfallback)
curl -s -X POST http://localhost:11435/dashboard/api/settings \
-H Content-Type: application/json \
-d {auto_pull: false}
Ollama模型管理
bash
查看每个节点的Ollama模型详细信息,包含大小和使用情况
curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool
将Ollama模型拉取到特定节点
curl -s -X POST http://localhost:11435/dashboard/api/pull \
-H Content-Type: application/json \
-d {model: llama3.3:70b, node_id: mac-studio}
从特定节点删除Ollama模型
curl -s -X POST http://localhost:11435/dashboard/api/delete \
-H Content-Type: application/json \
-d {model: old-model:7b, node_id: mac-studio}
Ollama模型洞察(汇总统计)
bash
curl -s http://localhost:11435/dashboard/api/models | python3 -m json.tool
每个应用的Ollama分析(需要请求标签)
bash
curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool
Ollama仪表板
Ollama Web仪表板位于http://localhost:11435/dashboard。它有八个标签页:
- - 舰队概览 — 通过SSE实时显示Ollama节点卡片、队列深度和请求计数
- 趋势 — Ollama每小时请求数、平均延迟和令牌吞吐量图表(24小时至7天)
- 模型洞察 — 每个Ollama模型的延迟、令牌/秒、使用比较
- 应用 — 每个标签的Ollama分析,包含请求量、延迟、令牌数、错误率
- 基准测试 — Ollama容量随时间增长,包含每次运行的吞吐量和延迟百分位数
- 健康 — 15项自动化Ollama舰队健康检查,包含严重级别
- 推荐 — 每个节点的Ollama模型组合推荐,支持一键拉取
- 设置 — Ollama运行时切换开关、只读配置表和节点版本跟踪
引导用户在浏览器中打开此URL以进行可视化Ollama监控。
Ollama弹性功能
- - 自动重试 — 如果Ollama节点在第一个响应块之前失败,重新评分并在次优Ollama节点上重试(最多2次重试)
- Ollama模型回退 — 客户端指定备用Ollama模型;当主模型不可用时尝试替代方案
- 上下文保护 — 在不需要时从Ollama请求中移除num_ctx以防止Ollama模型重新加载挂起;自动升级到更大的已加载模型
- VRAM感知回退 — 路由到同一类别中已加载的Ollama模型,而不是冷加载
- 僵尸清理 — 后台任务检测并清理卡住的进行中Ollama请求
- 自动拉取 — 自动将缺失的Ollama模型拉取到最佳可用节点
常见Ollama任务
检查Ollama舰队是否健康
- 1. 访问/fleet/status并验证nodes_online > 0
- 访问/dashboard/api/health获取带严重级别的自动化Ollama健康检查
- 查看Ollama队列深度——深度队列可能表示瓶颈
查找哪个Ollama节点拥有特定模型
- 1. 访问/fleet/status并检查每个Ollama节点的ollama.modelsloaded和ollama.modelsavailable
- 或访问/api/tags获取所有可用Ollama模型的平面列表