Ollama Herd Fleet Manager

You are managing an Ollama Herd fleet — a smart Ollama multimodal router that distributes Ollama AI workloads across multiple devices. Ollama Herd handles 4 model types: Ollama LLM inference, image generation (mflux), speech-to-text (Qwen3-ASR), and Ollama embeddings. The Ollama scoring engine evaluates nodes on 7 signals (thermal state, memory fit, queue depth, latency history, role affinity, availability trend, context fit) and routes each Ollama request to the optimal device.

Install Ollama Herd

CODEBLOCK0

PyPI: ollama-herd | Source: github.com/geeks-accelerator/ollama-herd

Ollama Router endpoint

The Ollama Herd router runs at http://localhost:11435 by default. If the user has specified a different Ollama URL, use that instead.

Ollama API endpoints

Use curl to interact with the Ollama fleet:

Ollama fleet status — overview of all Ollama nodes and queues

CODEBLOCK1

Returns:

- fleet.nodes_total / fleet.nodes_online — how many Ollama devices are in the fleet
INLINECODE4 — total Ollama models currently loaded across all nodes
INLINECODE5 — total in-flight Ollama requests
INLINECODE6 — per-node details: Ollama status, hardware, memory, CPU, disk, loaded Ollama models with context lengths
INLINECODE7 — per Ollama node:model queue depths (pending, in-flight, done, failed)

List all Ollama models available across the fleet

CODEBLOCK2

Pull an Ollama model onto the fleet

CODEBLOCK3

List Ollama models currently loaded in memory

CODEBLOCK4

OpenAI-compatible Ollama model list

CODEBLOCK5

Ollama usage statistics (per-node, per-model daily aggregates)

CODEBLOCK6

Recent Ollama request traces

CODEBLOCK7

Returns the last N Ollama routing decisions with: model requested, node selected, score, latency, tokens, retry/fallback status, tags.

Ollama fleet health analysis

CODEBLOCK8

Returns 15 automated Ollama health checks: offline/degraded nodes, memory pressure, underutilized nodes, VRAM fallbacks, KV cache bloat (OLLAMANUMPARALLEL too high), version mismatch, context protection, zombie reaper, Ollama model thrashing, request timeouts, error rates, retry rates, client disconnects, and incomplete streams.

Ollama model recommendations

CODEBLOCK9

Returns AI-powered Ollama model mix recommendations per node based on hardware capabilities, Ollama usage patterns, and curated benchmark data.

Ollama settings

CODEBLOCK10

Ollama model management

CODEBLOCK11

Ollama model insights (summary statistics)

CODEBLOCK12

Per-app Ollama analytics (requires request tagging)

CODEBLOCK13

Ollama Dashboard

The Ollama web dashboard is at http://localhost:11435/dashboard. It has eight tabs:

- Fleet Overview — live Ollama node cards, queue depths, and request counts via SSE
Trends — Ollama requests per hour, average latency, and token throughput charts (24h–7d)
Model Insights — per-Ollama-model latency, tokens/sec, usage comparison
Apps — per-tag Ollama analytics with request volume, latency, tokens, error rates
Benchmarks — Ollama capacity growth over time with per-run throughput and latency percentiles
Health — 15 automated Ollama fleet health checks with severity levels
Recommendations — Ollama model mix recommendations per node with one-click pull
Settings — Ollama runtime toggle switches, read-only config tables, and node version tracking

Direct the user to open this URL in their browser for visual Ollama monitoring.

Ollama Resilience features

- Auto-retry — if an Ollama node fails before the first response chunk, re-scores and retries on the next-best Ollama node (up to 2 retries)
Ollama model fallbacks — clients specify backup Ollama models; tries alternatives when the primary is unavailable
Context protection — strips num_ctx from Ollama requests when unnecessary to prevent Ollama model reload hangs; auto-upgrades to a larger loaded model
VRAM-aware fallback — routes to an already-loaded Ollama model in the same category instead of cold-loading
Zombie reaper — background task detects and cleans up stuck in-flight Ollama requests
Auto-pull — automatically pulls missing Ollama models onto the best available node

Common Ollama tasks

Check if the Ollama fleet is healthy

1. Hit /fleet/status and verify INLINECODE11
Hit /dashboard/api/health for automated Ollama health checks with severity levels
Look at Ollama queue depths — deep queues may indicate a bottleneck

Find which Ollama node has a specific model

1. Hit /fleet/status and inspect each Ollama node's ollama.models_loaded and INLINECODE15
Or hit /api/tags for a flat list of all available Ollama models with which nodes have them

Check if an Ollama model is loaded (hot) or cold

1. Hit /api/ps — Ollama models listed here are currently loaded in memory (hot)
Models in /api/tags but not in /api/ps are on disk but not loaded (cold)

View recent Ollama inference activity

1. Hit /dashboard/api/traces?limit=10 to see the last 10 Ollama requests
Each trace shows: Ollama model, node, score, latency, tokens, retry/fallback status

Diagnose slow Ollama responses

1. Check /dashboard/api/traces for high latency Ollama entries
Check /fleet/status for Ollama nodes with high queue depths or memory pressure
Check if the Ollama model had to cold-load (look for low scores in trace)
Check if num_ctx is being sent — Ollama context protection logs show if requests triggered reloads

Query the Ollama trace database directly

CODEBLOCK14

Test Ollama inference through the fleet

CODEBLOCK15

Ollama Guardrails

- Never restart or stop the Ollama Herd router or Ollama node agents without explicit user confirmation.
Never delete or modify files in ~/.fleet-manager/ (contains Ollama latency data, traces, and logs).
Do not pull Ollama models onto nodes without user confirmation — Ollama model downloads can be large (10-100+ GB).
Do not delete Ollama models without user confirmation.
If an Ollama node shows as offline, report it to the user rather than attempting to SSH into the machine.

Ollama Failure handling

- If curl to the Ollama router fails with connection refused, tell the user the Ollama Herd router may not be running and suggest herd to start it.
If the Ollama fleet status shows 0 nodes online, suggest starting Ollama node agents with herd-node on their devices.
If Ollama mDNS discovery fails, suggest using --router-url http://router-ip:11435 for explicit connection.
If Ollama requests hang with 0 bytes returned, check if the client is sending num_ctx — Ollama context protection should strip it.
If a specific Ollama API endpoint returns an error, show the user the full error response and suggest checking the Ollama JSONL logs at ~/.fleet-manager/logs/herd.jsonl.

Ollama Herd 舰队管理器

您正在管理一个Ollama Herd舰队——一个智能的Ollama多模态路由器，可将Ollama AI工作负载分发到多个设备。Ollama Herd处理4种模型类型：Ollama LLM推理、图像生成（mflux）、语音转文本（Qwen3-ASR）和Ollama嵌入。Ollama评分引擎根据7个信号（热状态、内存适配度、队列深度、延迟历史、角色亲和性、可用性趋势、上下文适配度）评估节点，并将每个Ollama请求路由到最佳设备。

安装Ollama Herd

bash
pip install ollama-herd # 从PyPI安装Ollama Herd
herd # 启动Ollama路由器
herd-node # 启动Ollama节点代理（在每个设备上运行）

PyPI：ollama-herd | 源码：github.com/geeks-accelerator/ollama-herd

Ollama路由器端点

Ollama Herd路由器默认运行在http://localhost:11435。如果用户指定了不同的Ollama URL，则使用该URL。

Ollama API端点

使用curl与Ollama舰队交互：

Ollama舰队状态——所有Ollama节点和队列概览

bash

ollamafleetstatus — 检查Ollama节点健康状态

curl -s http://localhost:11435/fleet/status | python3 -m json.tool

- fleet.nodestotal / fleet.nodesonline — 舰队中Ollama设备数量
fleet.modelsloaded — 当前所有节点上加载的Ollama模型总数
fleet.requestsactive — 正在进行的Ollama请求总数
nodes[] — 每个节点的详细信息：Ollama状态、硬件、内存、CPU、磁盘、已加载的Ollama模型及上下文长度
queues — 每个Ollama节点:模型的队列深度（待处理、进行中、已完成、失败）

列出舰队中所有可用的Ollama模型

bash

ollamamodellist — 所有节点上的所有Ollama模型

curl -s http://localhost:11435/api/tags | python3 -m json.tool

将Ollama模型拉取到舰队

bash

ollamapullmodel — 拉取模型（自动选择最佳节点，流式传输进度）

curl -N http://localhost:11435/api/pull -d {name: codestral}

拉取到特定节点

curl -N http://localhost:11435/api/pull -d {name: llama3.3:70b, node_id: mac-studio}

非流式传输（阻塞直到完成）

curl http://localhost:11435/api/pull -d {name: phi4, stream: false}

列出当前加载到内存中的Ollama模型

bash

ollamaloadedmodels — GPU内存中的热Ollama模型

curl -s http://localhost:11435/api/ps | python3 -m json.tool

兼容OpenAI的Ollama模型列表

bash curl -s http://localhost:11435/v1/models | python3 -m json.tool

Ollama使用统计（每个节点、每个模型的每日汇总）

bash curl -s http://localhost:11435/dashboard/api/usage | python3 -m json.tool

最近的Ollama请求追踪

bash

ollama_traces — 最近的Ollama路由决策

curl -s http://localhost:11435/dashboard/api/traces?limit=20 | python3 -m json.tool

返回最近N个Ollama路由决策，包含：请求的模型、选择的节点、评分、延迟、令牌数、重试/回退状态、标签。

Ollama舰队健康分析

bash curl -s http://localhost:11435/dashboard/api/health | python3 -m json.tool

返回15项自动化Ollama健康检查：离线/降级节点、内存压力、未充分利用的节点、VRAM回退、KV缓存膨胀（OLLAMANUMPARALLEL过高）、版本不匹配、上下文保护、僵尸清理、Ollama模型抖动、请求超时、错误率、重试率、客户端断开连接和不完整流。

Ollama模型推荐

bash curl -s http://localhost:11435/dashboard/api/recommendations | python3 -m json.tool

返回基于硬件能力、Ollama使用模式和精选基准数据的每个节点AI驱动Ollama模型组合推荐。

Ollama设置

bash

查看当前Ollama配置和节点版本

curl -s http://localhost:11435/dashboard/api/settings | python3 -m json.tool

切换Ollama运行时设置（autopull, vramfallback）

curl -s -X POST http://localhost:11435/dashboard/api/settings \ -H Content-Type: application/json \ -d {auto_pull: false}

Ollama模型管理

bash

查看每个节点的Ollama模型详细信息，包含大小和使用情况

curl -s http://localhost:11435/dashboard/api/model-management | python3 -m json.tool

将Ollama模型拉取到特定节点

curl -s -X POST http://localhost:11435/dashboard/api/pull \ -H Content-Type: application/json \ -d {model: llama3.3:70b, node_id: mac-studio}

从特定节点删除Ollama模型

curl -s -X POST http://localhost:11435/dashboard/api/delete \ -H Content-Type: application/json \ -d {model: old-model:7b, node_id: mac-studio}

Ollama模型洞察（汇总统计）

bash curl -s http://localhost:11435/dashboard/api/models | python3 -m json.tool

每个应用的Ollama分析（需要请求标签）

bash curl -s http://localhost:11435/dashboard/api/apps | python3 -m json.tool

Ollama仪表板

Ollama Web仪表板位于http://localhost:11435/dashboard。它有八个标签页：

- 舰队概览 — 通过SSE实时显示Ollama节点卡片、队列深度和请求计数
趋势 — Ollama每小时请求数、平均延迟和令牌吞吐量图表（24小时至7天）
模型洞察 — 每个Ollama模型的延迟、令牌/秒、使用比较
应用 — 每个标签的Ollama分析，包含请求量、延迟、令牌数、错误率
基准测试 — Ollama容量随时间增长，包含每次运行的吞吐量和延迟百分位数
健康 — 15项自动化Ollama舰队健康检查，包含严重级别
推荐 — 每个节点的Ollama模型组合推荐，支持一键拉取
设置 — Ollama运行时切换开关、只读配置表和节点版本跟踪

引导用户在浏览器中打开此URL以进行可视化Ollama监控。

Ollama弹性功能

- 自动重试 — 如果Ollama节点在第一个响应块之前失败，重新评分并在次优Ollama节点上重试（最多2次重试）
Ollama模型回退 — 客户端指定备用Ollama模型；当主模型不可用时尝试替代方案
上下文保护 — 在不需要时从Ollama请求中移除num_ctx以防止Ollama模型重新加载挂起；自动升级到更大的已加载模型
VRAM感知回退 — 路由到同一类别中已加载的Ollama模型，而不是冷加载
僵尸清理 — 后台任务检测并清理卡住的进行中Ollama请求
自动拉取 — 自动将缺失的Ollama模型拉取到最佳可用节点

常见Ollama任务

检查Ollama舰队是否健康

1. 访问/fleet/status并验证nodes_online > 0
访问/dashboard/api/health获取带严重级别的自动化Ollama健康检查
查看Ollama队列深度——深度队列可能表示瓶颈

查找哪个Ollama节点拥有特定模型

1. 访问/fleet/status并检查每个Ollama节点的ollama.modelsloaded和ollama.modelsavailable
或访问/api/tags获取所有可用Ollama模型的平面列表

ollama-herdOllama模型路由

ollama-herd

Ollama Herd Fleet Manager

Install Ollama Herd

Ollama Router endpoint

Ollama API endpoints

Ollama fleet status — overview of all Ollama nodes and queues

List all Ollama models available across the fleet

Pull an Ollama model onto the fleet

List Ollama models currently loaded in memory

OpenAI-compatible Ollama model list

Ollama usage statistics (per-node, per-model daily aggregates)

Recent Ollama request traces

Ollama fleet health analysis

Ollama model recommendations

Ollama settings

Ollama model management

Ollama model insights (summary statistics)

Per-app Ollama analytics (requires request tagging)

Ollama Dashboard

Ollama Resilience features

Common Ollama tasks

Check if the Ollama fleet is healthy

Find which Ollama node has a specific model

Check if an Ollama model is loaded (hot) or cold

View recent Ollama inference activity

Diagnose slow Ollama responses

Query the Ollama trace database directly

Test Ollama inference through the fleet

Ollama Guardrails

Ollama Failure handling

Ollama Herd 舰队管理器

安装Ollama Herd

Ollama路由器端点

Ollama API端点

Ollama舰队状态——所有Ollama节点和队列概览

ollamafleetstatus — 检查Ollama节点健康状态

列出舰队中所有可用的Ollama模型

ollamamodellist — 所有节点上的所有Ollama模型

将Ollama模型拉取到舰队

ollamapullmodel — 拉取模型（自动选择最佳节点，流式传输进度）

拉取到特定节点

非流式传输（阻塞直到完成）

列出当前加载到内存中的Ollama模型

ollamaloadedmodels — GPU内存中的热Ollama模型

兼容OpenAI的Ollama模型列表

Ollama使用统计（每个节点、每个模型的每日汇总）

最近的Ollama请求追踪

ollama_traces — 最近的Ollama路由决策

Ollama舰队健康分析

Ollama模型推荐

Ollama设置

查看当前Ollama配置和节点版本

切换Ollama运行时设置（autopull, vramfallback）

Ollama模型管理

查看每个节点的Ollama模型详细信息，包含大小和使用情况

将Ollama模型拉取到特定节点

从特定节点删除Ollama模型

Ollama模型洞察（汇总统计）

每个应用的Ollama分析（需要请求标签）

Ollama仪表板

Ollama弹性功能

常见Ollama任务

检查Ollama舰队是否健康

查找哪个Ollama节点拥有特定模型

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement