Ollama

When to Use

User needs to install, run, integrate, tune, or debug Ollama for local or self-hosted model workflows. Agent handles smoke tests, model selection, API usage, Modelfile customization, embeddings, RAG fit checks, and safe operations.

Use this instead of generic AI advice when the blocker is specific to local runtime behavior: wrong model tag, broken JSON output, poor retrieval, slow inference, context sizing, GPU fallback, or unsafe remote exposure.

Architecture

Memory lives in ~/ollama/. If ~/ollama/ does not exist, run setup.md. See memory-template.md for structure.

CODEBLOCK0

Quick Reference

Load only the file needed for the current blocker.

Topic	File
Setup guide	INLINECODE4
Memory template

Requirements

- Local ollama access on the target machine, or permission to guide installation.
Enough RAM, VRAM, and disk for the exact model and context window being proposed.
Explicit user approval before exposing Ollama beyond localhost, changing service managers, or deleting model files.
Exact model tags and runtime facts must be verified with live commands such as ollama list, ollama ps, and ollama show.

Never assume model capabilities, context length, quantization, or GPU usage from memory alone.

Operating Coverage

This skill is for practical Ollama execution, not abstract local-LLM discussion. It covers:

- local installs on macOS, Linux, and Windows
CLI workflows for pull, run, copy, show, create, and remove
REST API usage on http://127.0.0.1:11434/api and OpenAI-compatible usage on INLINECODE17
hardware-aware model sizing, context tuning, and throughput tradeoffs
Modelfile-based customization for prompts, parameters, adapters, and reproducible model names
embeddings and local RAG pipelines where indexing, querying, and retrieval must stay consistent

Data Storage

Keep only durable operational context in ~/ollama/:

- host facts that materially change advice: OS, GPU class, CPU-only constraints, service manager, remote or local deployment
approved model tags, copied aliases, quant choices, and context limits that worked in practice
Modelfile defaults, JSON output patterns, and safe OpenAI-compatible mappings
embedding model choices, vector dimensions, chunking defaults, and retrieval checks
recurring failures such as partial pulls, CPU fallback, port conflicts, or broken upgrades

Core Rules

1. Verify the Runtime Before Giving Advice

- Confirm ollama is installed and reachable before proposing any deeper fix.
Start with the smallest factual checks: ollama --version, ollama list, ollama ps, and one minimal generation or /api/tags request.
Treat "it runs" and "it runs correctly" as different states.

2. Pin Exact Model Names and Inspect Them Live

- Use exact tags, not vague family names, for anything reproducible or production-adjacent.
Inspect the real model with ollama show or /api/show before claiming context length, quantization, or capabilities.
Avoid silent drift from floating tags when stability matters.

3. Separate Runtime, Modelfile, and App Prompt Responsibilities

- Debug local behavior in layers: runtime first, then model definition, then application prompt.
If output quality changed, check whether SYSTEM, TEMPLATE, or PARAMETER settings in the Modelfile are fighting the app prompt.
Put durable defaults in a named model, not in ad hoc copy-pasted prompts.

4. Choose Models by Hardware and Latency Budget

- A model that technically loads but falls back to CPU or swaps memory is not a good fit.
Use ollama ps to confirm processor split before promising performance.
Keep separate defaults for chat, coding, extraction, vision, and embeddings instead of forcing one model to do everything.

5. Make API and Structured Output Flows Deterministic

- Prefer non-streaming responses when the next step needs strict parsing.
Use format: "json" or a JSON schema, set low temperature, and validate the parsed result before taking downstream actions.
For OpenAI-compatible clients, verify /v1 assumptions instead of assuming every feature maps 1:1.

6. Treat Embeddings and RAG as a Single System

- Use the same embedding model for indexing and querying unless you intentionally migrate and re-index.
Inspect retrieved chunks before blaming the model for weak answers.
Fix chunking, metadata, top-k, and vector dimensions before increasing prompt size.

7. Treat Remote Access and Upgrades as Operational Changes

- Do not bind Ollama to non-localhost or open port 11434 without explicit approval and a minimal-risk network plan.
Record service manager changes, environment variables, and rollback steps before upgrading.
Protect model storage and disk headroom before large pulls or replacements.

Ollama Traps

- Using latest everywhere -> upgrades silently change behavior and break reproducibility.
Testing only with ollama run -> app integration still fails on /api or /v1.
Assuming slow responses mean "bad model" -> often it is CPU fallback, oversized context, or disk pressure.
Letting app prompts and Modelfile instructions fight each other -> outputs become inconsistent and hard to debug.
Re-indexing with one embedding model and querying with another -> retrieval quality collapses without obvious errors.
Exposing the API on a LAN without auth or scoping -> local convenience becomes a security problem.
Chasing larger context before fixing retrieval or prompt shape -> memory use rises while answer quality barely improves.

External Endpoints

Use external network access only when the task requires model downloads, official docs lookup, or optional cloud execution explicitly approved by the user.

Endpoint	Data Sent	Purpose
https://ollama.com/*	model identifiers, optional doc queries, and optional cloud API requests	Official docs, library lookups, model pulls managed by the Ollama runtime, and optional cloud execution

No other data is sent externally.

Security & Privacy

Data that leaves your machine:

- model identifiers and download requests when pulling models through Ollama
optional prompts and attachments only if the user explicitly chooses https://ollama.com/api instead of local inference
optional documentation lookups against official Ollama pages

Data that stays local:

- prompts and outputs served through the local Ollama runtime on the user machine
durable workflow notes under INLINECODE38
local Modelfiles, retrieval notes, and performance baselines unless the user exports them

This skill does NOT:

- expose Ollama remotely without explicit approval
store OLLAMA_API_KEY or other secrets in skill files
mix local and cloud execution silently
invent unsupported model features, GPU behavior, or API compatibility
recommend remote installers or destructive cleanup without explaining risk first

Trust

By using this skill, model pulls and optional cloud requests may go to Ollama infrastructure when the user explicitly chooses those paths.
Only install if you trust Ollama with that data.

Scope

This skill ONLY:

- installs, verifies, operates, and troubleshoots Ollama workflows
helps choose, pin, inspect, and customize models with reproducible patterns
keeps local memory for host constraints, model defaults, and recurring failure fixes

This skill NEVER:

- claim that every Ollama model supports the same tools, context, or JSON reliability
recommend unauthenticated remote exposure as a default
treat local RAG quality as solved without checking embeddings, chunking, and retrieval results
modify its own skill files

Related Skills

Install with clawhub install <slug> if user confirms:

- ai - Frame when local Ollama is the right fit versus cloud inference.
INLINECODE42 - Compare local model families, sizes, and capability tradeoffs before pinning defaults.
INLINECODE43 - Reuse robust HTTP request, retry, and parsing patterns around local services.
INLINECODE44 - Extend vector search and chunking strategy beyond the Ollama runtime itself.
INLINECODE45 - Integrate Ollama into multi-step chains, agents, and retrieval pipelines.

Feedback

- If useful: INLINECODE46
Stay updated: INLINECODE47

何时使用

用户需要安装、运行、集成、调优或调试Ollama，用于本地或自托管模型工作流。Agent负责处理冒烟测试、模型选择、API使用、Modelfile定制、嵌入向量、RAG适配检查以及安全操作。

当阻塞问题特定于本地运行时行为（如错误的模型标签、损坏的JSON输出、检索效果差、推理速度慢、上下文窗口大小、GPU回退或不安全的远程暴露）时，应使用此技能而非通用AI建议。

架构

内存数据存储在 ~/ollama/ 目录下。如果 ~/ollama/ 不存在，请运行 setup.md。目录结构参见 memory-template.md。

快速参考

仅加载当前阻塞问题所需的文件。

主题	文件
安装指南	setup.md
内存模板

要求

- 目标机器上可本地访问 ollama，或有权限指导安装。
有足够的RAM、VRAM和磁盘空间来运行所提议的特定模型和上下文窗口。
在将Ollama暴露到localhost之外、更改服务管理器或删除模型文件之前，需获得用户明确批准。
必须通过实时命令（如 ollama list、ollama ps 和 ollama show）验证确切的模型标签和运行时信息。

切勿仅凭内存数据假设模型能力、上下文长度、量化方式或GPU使用情况。

操作范围

此技能专注于实际的Ollama执行，而非抽象的本地LLM讨论。涵盖范围包括：

- 在macOS、Linux和Windows上的本地安装
pull、run、copy、show、create和remove的CLI工作流
在 http://127.0.0.1:11434/api 上的REST API使用以及在 /v1 上的OpenAI兼容使用
硬件感知的模型大小选择、上下文调优和吞吐量权衡
基于Modelfile的提示词、参数、适配器和可复现模型名称定制
嵌入向量和本地RAG管道，确保索引、查询和检索保持一致

数据存储

仅在 ~/ollama/ 中保留持久的操作上下文：

- 会实质改变建议的主机信息：操作系统、GPU类型、仅CPU限制、服务管理器、远程或本地部署
已批准的模型标签、复制的别名、量化选择以及实践中有效的上下文限制
Modelfile默认值、JSON输出模式和安全的OpenAI兼容映射
嵌入模型选择、向量维度、默认分块策略和检索检查
重复出现的故障，如部分拉取、CPU回退、端口冲突或损坏的升级

核心规则

1. 在给出建议前验证运行时

- 在提出任何深入修复方案前，确认 ollama 已安装并可访问。
从最小的事实检查开始：ollama --version、ollama list、ollama ps 以及一次最小生成或 /api/tags 请求。
将它能运行和它能正确运行视为两种不同的状态。

2. 锁定确切的模型名称并实时检查

- 对于任何需要可复现或接近生产环境的内容，使用确切的标签，而非模糊的系列名称。
在声称上下文长度、量化方式或能力之前，使用 ollama show 或 /api/show 检查实际模型。
当稳定性至关重要时，避免因浮动标签导致的静默漂移。

3. 分离运行时、Modelfile和应用提示词的责任

- 分层调试本地行为：先检查运行时，然后是模型定义，最后是应用提示词。
如果输出质量发生变化，检查Modelfile中的 SYSTEM、TEMPLATE 或 PARAMETER 设置是否与应用提示词冲突。
将持久的默认值放在命名模型中，而不是临时复制粘贴的提示词中。

4. 根据硬件和延迟预算选择模型

- 一个技术上能加载但回退到CPU或交换内存的模型并不是合适的选择。
在承诺性能之前，使用 ollama ps 确认处理器分配情况。
为聊天、编码、提取、视觉和嵌入向量分别设置默认值，而不是强制一个模型做所有事情。

5. 使API和结构化输出流程具有确定性

- 当下一步需要严格解析时，优先使用非流式响应。
使用 format: json 或JSON schema，设置低温度，并在执行下游操作前验证解析结果。
对于OpenAI兼容的客户端，验证 /v1 的假设，而不是假设每个功能都能一一对应。

6. 将嵌入向量和RAG视为一个整体系统

- 除非有意迁移并重新索引，否则对索引和查询使用相同的嵌入模型。
在将弱答案归咎于模型之前，先检查检索到的分块内容。
在增加提示词大小之前，先修复分块策略、元数据、top-k和向量维度。

7. 将远程访问和升级视为操作变更

- 未经明确批准和最小风险网络计划，不得将Ollama绑定到非localhost或开放端口 11434。
在升级之前，记录服务管理器变更、环境变量和回滚步骤。
在进行大量拉取或替换之前，保护模型存储和磁盘空间。

Ollama常见陷阱

- 到处使用 latest -> 升级会静默改变行为并破坏可复现性。
仅用 ollama run 测试 -> 应用集成在 /api 或 /v1 上仍然失败。
假设响应慢意味着模型差 -> 通常是CPU回退、上下文过大或磁盘压力导致。
让应用提示词和Modelfile指令相互冲突 -> 输出变得不一致且难以调试。
用一种嵌入模型重新索引，用另一种模型查询 -> 检索质量崩溃且无明显错误。
在没有认证或范围限制的情况下在LAN上暴露API -> 本地便利变成安全问题。
在修复检索或提示词结构之前追求更大的上下文 -> 内存使用增加而答案质量几乎没有改善。

外部端点

仅在任务需要模型下载、官方文档查询或用户明确批准的可选云执行时使用外部网络访问。

端点	发送的数据	目的
https://ollama.com/*	模型标识符、可选的文档查询和可选的云API请求	官方文档、库查询、由Ollama运行时管理的模型拉取以及可选的云执行

不会向外部发送其他数据。

安全与隐私

离开您机器的数据：

- 通过Ollama拉取模型时的模型标识符和下载请求
仅当用户明确选择 https://ollama.com/api 而非本地推理时的可选提示词和附件
针对官方Ollama页面的可选文档查询

保留在本地数据：

- 通过用户机器上的本地Ollama运行时提供的提示词和输出
~/ollama/ 下的持久化工作流笔记
本地Modelfile、检索笔记和性能基线，除非用户导出

此技能不会：

- 未经明确批准远程暴露Ollama
在技能文件中存储 OLLAMAAPIKEY 或其他密钥
静默混合本地和云执行
编造不支持的模型功能、GPU行为或API兼容性
在不事先解释风险的情况下推荐远程安装程序或破坏性清理

信任

使用此技能时，当用户明确选择这些路径时，模型拉取和可选的云请求可能会发送到Ollama基础设施。
仅当您信任Ollama处理这些数据时才安装。

范围

此技能仅：

- 安装、验证、操作和排查Ollama工作流
帮助选择、锁定、检查和定制模型，使用可复现的模式
为主机限制、模型默认值和重复故障修复保留本地内存

此技能绝不：

- 声称每个Ollama模型都支持相同的工具、上下文或JSON可靠性

OllamaOllama模型管理

When to Use

Architecture

Quick Reference

Requirements

Operating Coverage

Data Storage

Core Rules

1. Verify the Runtime Before Giving Advice

2. Pin Exact Model Names and Inspect Them Live

3. Separate Runtime, Modelfile, and App Prompt Responsibilities

4. Choose Models by Hardware and Latency Budget

5. Make API and Structured Output Flows Deterministic

6. Treat Embeddings and RAG as a Single System

7. Treat Remote Access and Upgrades as Operational Changes

Ollama Traps

External Endpoints

Security & Privacy

Trust

Scope

Related Skills

Feedback

何时使用

架构

快速参考

要求

操作范围

数据存储

核心规则

1. 在给出建议前验证运行时

2. 锁定确切的模型名称并实时检查

3. 分离运行时、Modelfile和应用提示词的责任

4. 根据硬件和延迟预算选择模型

5. 使API和结构化输出流程具有确定性

6. 将嵌入向量和RAG视为一个整体系统

7. 将远程访问和升级视为操作变更

Ollama常见陷阱

外部端点

安全与隐私

信任

范围

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement