When to Use
User needs to install, run, integrate, tune, or debug Ollama for local or self-hosted model workflows. Agent handles smoke tests, model selection, API usage, Modelfile customization, embeddings, RAG fit checks, and safe operations.
Use this instead of generic AI advice when the blocker is specific to local runtime behavior: wrong model tag, broken JSON output, poor retrieval, slow inference, context sizing, GPU fallback, or unsafe remote exposure.
Architecture
Memory lives in ~/ollama/. If ~/ollama/ does not exist, run setup.md. See memory-template.md for structure.
CODEBLOCK0
Quick Reference
Load only the file needed for the current blocker.
| Topic | File |
|---|
| Setup guide | INLINECODE4 |
| Memory template |
memory-template.md |
| Install and smoke-test workflow |
install-and-smoke-test.md |
| Local API and OpenAI-compatible patterns |
api-patterns.md |
| Modelfile creation and context control |
modelfile-workflows.md |
| Embeddings and local RAG checks |
embeddings-and-rag.md |
| Runtime operations and performance tuning |
operations-and-performance.md |
| Failure recovery and incident triage |
troubleshooting.md |
Requirements
- - Local
ollama access on the target machine, or permission to guide installation. - Enough RAM, VRAM, and disk for the exact model and context window being proposed.
- Explicit user approval before exposing Ollama beyond localhost, changing service managers, or deleting model files.
- Exact model tags and runtime facts must be verified with live commands such as
ollama list, ollama ps, and ollama show.
Never assume model capabilities, context length, quantization, or GPU usage from memory alone.
Operating Coverage
This skill is for practical Ollama execution, not abstract local-LLM discussion. It covers:
- - local installs on macOS, Linux, and Windows
- CLI workflows for pull, run, copy, show, create, and remove
- REST API usage on
http://127.0.0.1:11434/api and OpenAI-compatible usage on INLINECODE17 - hardware-aware model sizing, context tuning, and throughput tradeoffs
- Modelfile-based customization for prompts, parameters, adapters, and reproducible model names
- embeddings and local RAG pipelines where indexing, querying, and retrieval must stay consistent
Data Storage
Keep only durable operational context in ~/ollama/:
- - host facts that materially change advice: OS, GPU class, CPU-only constraints, service manager, remote or local deployment
- approved model tags, copied aliases, quant choices, and context limits that worked in practice
- Modelfile defaults, JSON output patterns, and safe OpenAI-compatible mappings
- embedding model choices, vector dimensions, chunking defaults, and retrieval checks
- recurring failures such as partial pulls, CPU fallback, port conflicts, or broken upgrades
Core Rules
1. Verify the Runtime Before Giving Advice
- - Confirm
ollama is installed and reachable before proposing any deeper fix. - Start with the smallest factual checks:
ollama --version, ollama list, ollama ps, and one minimal generation or /api/tags request. - Treat "it runs" and "it runs correctly" as different states.
2. Pin Exact Model Names and Inspect Them Live
- - Use exact tags, not vague family names, for anything reproducible or production-adjacent.
- Inspect the real model with
ollama show or /api/show before claiming context length, quantization, or capabilities. - Avoid silent drift from floating tags when stability matters.
3. Separate Runtime, Modelfile, and App Prompt Responsibilities
- - Debug local behavior in layers: runtime first, then model definition, then application prompt.
- If output quality changed, check whether
SYSTEM, TEMPLATE, or PARAMETER settings in the Modelfile are fighting the app prompt. - Put durable defaults in a named model, not in ad hoc copy-pasted prompts.
4. Choose Models by Hardware and Latency Budget
- - A model that technically loads but falls back to CPU or swaps memory is not a good fit.
- Use
ollama ps to confirm processor split before promising performance. - Keep separate defaults for chat, coding, extraction, vision, and embeddings instead of forcing one model to do everything.
5. Make API and Structured Output Flows Deterministic
- - Prefer non-streaming responses when the next step needs strict parsing.
- Use
format: "json" or a JSON schema, set low temperature, and validate the parsed result before taking downstream actions. - For OpenAI-compatible clients, verify
/v1 assumptions instead of assuming every feature maps 1:1.
6. Treat Embeddings and RAG as a Single System
- - Use the same embedding model for indexing and querying unless you intentionally migrate and re-index.
- Inspect retrieved chunks before blaming the model for weak answers.
- Fix chunking, metadata, top-k, and vector dimensions before increasing prompt size.
7. Treat Remote Access and Upgrades as Operational Changes
- - Do not bind Ollama to non-localhost or open port
11434 without explicit approval and a minimal-risk network plan. - Record service manager changes, environment variables, and rollback steps before upgrading.
- Protect model storage and disk headroom before large pulls or replacements.
Ollama Traps
- - Using
latest everywhere -> upgrades silently change behavior and break reproducibility. - Testing only with
ollama run -> app integration still fails on /api or /v1. - Assuming slow responses mean "bad model" -> often it is CPU fallback, oversized context, or disk pressure.
- Letting app prompts and Modelfile instructions fight each other -> outputs become inconsistent and hard to debug.
- Re-indexing with one embedding model and querying with another -> retrieval quality collapses without obvious errors.
- Exposing the API on a LAN without auth or scoping -> local convenience becomes a security problem.
- Chasing larger context before fixing retrieval or prompt shape -> memory use rises while answer quality barely improves.
External Endpoints
Use external network access only when the task requires model downloads, official docs lookup, or optional cloud execution explicitly approved by the user.
| Endpoint | Data Sent | Purpose |
|---|
| https://ollama.com/* | model identifiers, optional doc queries, and optional cloud API requests | Official docs, library lookups, model pulls managed by the Ollama runtime, and optional cloud execution |
No other data is sent externally.
Security & Privacy
Data that leaves your machine:
- - model identifiers and download requests when pulling models through Ollama
- optional prompts and attachments only if the user explicitly chooses
https://ollama.com/api instead of local inference - optional documentation lookups against official Ollama pages
Data that stays local:
- - prompts and outputs served through the local Ollama runtime on the user machine
- durable workflow notes under INLINECODE38
- local Modelfiles, retrieval notes, and performance baselines unless the user exports them
This skill does NOT:
- - expose Ollama remotely without explicit approval
- store
OLLAMA_API_KEY or other secrets in skill files - mix local and cloud execution silently
- invent unsupported model features, GPU behavior, or API compatibility
- recommend remote installers or destructive cleanup without explaining risk first
Trust
By using this skill, model pulls and optional cloud requests may go to Ollama infrastructure when the user explicitly chooses those paths.
Only install if you trust Ollama with that data.
Scope
This skill ONLY:
- - installs, verifies, operates, and troubleshoots Ollama workflows
- helps choose, pin, inspect, and customize models with reproducible patterns
- keeps local memory for host constraints, model defaults, and recurring failure fixes
This skill NEVER:
- - claim that every Ollama model supports the same tools, context, or JSON reliability
- recommend unauthenticated remote exposure as a default
- treat local RAG quality as solved without checking embeddings, chunking, and retrieval results
- modify its own skill files
Related Skills
Install with
clawhub install <slug> if user confirms:
- -
ai - Frame when local Ollama is the right fit versus cloud inference. - INLINECODE42 - Compare local model families, sizes, and capability tradeoffs before pinning defaults.
- INLINECODE43 - Reuse robust HTTP request, retry, and parsing patterns around local services.
- INLINECODE44 - Extend vector search and chunking strategy beyond the Ollama runtime itself.
- INLINECODE45 - Integrate Ollama into multi-step chains, agents, and retrieval pipelines.
Feedback
- - If useful: INLINECODE46
- Stay updated: INLINECODE47
何时使用
用户需要安装、运行、集成、调优或调试Ollama,用于本地或自托管模型工作流。Agent负责处理冒烟测试、模型选择、API使用、Modelfile定制、嵌入向量、RAG适配检查以及安全操作。
当阻塞问题特定于本地运行时行为(如错误的模型标签、损坏的JSON输出、检索效果差、推理速度慢、上下文窗口大小、GPU回退或不安全的远程暴露)时,应使用此技能而非通用AI建议。
架构
内存数据存储在 ~/ollama/ 目录下。如果 ~/ollama/ 不存在,请运行 setup.md。目录结构参见 memory-template.md。
text
~/ollama/
|-- memory.md # 持久化上下文和激活边界
|-- environment.md # 主机、GPU、操作系统、运行时和服务说明
|-- model-registry.md # 已批准的模型、标签、量化版本和适配说明
|-- modelfiles.md # 可复用的Modelfile模式和参数决策
|-- rag-notes.md # 嵌入向量选择、分块策略、检索检查、向量维度
-- incident-log.md # 重复失败、修复方案和回滚记录
快速参考
仅加载当前阻塞问题所需的文件。
memory-template.md |
| 安装和冒烟测试工作流 | install-and-smoke-test.md |
| 本地API和OpenAI兼容模式 | api-patterns.md |
| Modelfile创建和上下文控制 | modelfile-workflows.md |
| 嵌入向量和本地RAG检查 | embeddings-and-rag.md |
| 运行时操作和性能调优 | operations-and-performance.md |
| 故障恢复和事件排查 | troubleshooting.md |
要求
- - 目标机器上可本地访问 ollama,或有权限指导安装。
- 有足够的RAM、VRAM和磁盘空间来运行所提议的特定模型和上下文窗口。
- 在将Ollama暴露到localhost之外、更改服务管理器或删除模型文件之前,需获得用户明确批准。
- 必须通过实时命令(如 ollama list、ollama ps 和 ollama show)验证确切的模型标签和运行时信息。
切勿仅凭内存数据假设模型能力、上下文长度、量化方式或GPU使用情况。
操作范围
此技能专注于实际的Ollama执行,而非抽象的本地LLM讨论。涵盖范围包括:
- - 在macOS、Linux和Windows上的本地安装
- pull、run、copy、show、create和remove的CLI工作流
- 在 http://127.0.0.1:11434/api 上的REST API使用以及在 /v1 上的OpenAI兼容使用
- 硬件感知的模型大小选择、上下文调优和吞吐量权衡
- 基于Modelfile的提示词、参数、适配器和可复现模型名称定制
- 嵌入向量和本地RAG管道,确保索引、查询和检索保持一致
数据存储
仅在 ~/ollama/ 中保留持久的操作上下文:
- - 会实质改变建议的主机信息:操作系统、GPU类型、仅CPU限制、服务管理器、远程或本地部署
- 已批准的模型标签、复制的别名、量化选择以及实践中有效的上下文限制
- Modelfile默认值、JSON输出模式和安全的OpenAI兼容映射
- 嵌入模型选择、向量维度、默认分块策略和检索检查
- 重复出现的故障,如部分拉取、CPU回退、端口冲突或损坏的升级
核心规则
1. 在给出建议前验证运行时
- - 在提出任何深入修复方案前,确认 ollama 已安装并可访问。
- 从最小的事实检查开始:ollama --version、ollama list、ollama ps 以及一次最小生成或 /api/tags 请求。
- 将它能运行和它能正确运行视为两种不同的状态。
2. 锁定确切的模型名称并实时检查
- - 对于任何需要可复现或接近生产环境的内容,使用确切的标签,而非模糊的系列名称。
- 在声称上下文长度、量化方式或能力之前,使用 ollama show 或 /api/show 检查实际模型。
- 当稳定性至关重要时,避免因浮动标签导致的静默漂移。
3. 分离运行时、Modelfile和应用提示词的责任
- - 分层调试本地行为:先检查运行时,然后是模型定义,最后是应用提示词。
- 如果输出质量发生变化,检查Modelfile中的 SYSTEM、TEMPLATE 或 PARAMETER 设置是否与应用提示词冲突。
- 将持久的默认值放在命名模型中,而不是临时复制粘贴的提示词中。
4. 根据硬件和延迟预算选择模型
- - 一个技术上能加载但回退到CPU或交换内存的模型并不是合适的选择。
- 在承诺性能之前,使用 ollama ps 确认处理器分配情况。
- 为聊天、编码、提取、视觉和嵌入向量分别设置默认值,而不是强制一个模型做所有事情。
5. 使API和结构化输出流程具有确定性
- - 当下一步需要严格解析时,优先使用非流式响应。
- 使用 format: json 或JSON schema,设置低温度,并在执行下游操作前验证解析结果。
- 对于OpenAI兼容的客户端,验证 /v1 的假设,而不是假设每个功能都能一一对应。
6. 将嵌入向量和RAG视为一个整体系统
- - 除非有意迁移并重新索引,否则对索引和查询使用相同的嵌入模型。
- 在将弱答案归咎于模型之前,先检查检索到的分块内容。
- 在增加提示词大小之前,先修复分块策略、元数据、top-k和向量维度。
7. 将远程访问和升级视为操作变更
- - 未经明确批准和最小风险网络计划,不得将Ollama绑定到非localhost或开放端口 11434。
- 在升级之前,记录服务管理器变更、环境变量和回滚步骤。
- 在进行大量拉取或替换之前,保护模型存储和磁盘空间。
Ollama常见陷阱
- - 到处使用 latest -> 升级会静默改变行为并破坏可复现性。
- 仅用 ollama run 测试 -> 应用集成在 /api 或 /v1 上仍然失败。
- 假设响应慢意味着模型差 -> 通常是CPU回退、上下文过大或磁盘压力导致。
- 让应用提示词和Modelfile指令相互冲突 -> 输出变得不一致且难以调试。
- 用一种嵌入模型重新索引,用另一种模型查询 -> 检索质量崩溃且无明显错误。
- 在没有认证或范围限制的情况下在LAN上暴露API -> 本地便利变成安全问题。
- 在修复检索或提示词结构之前追求更大的上下文 -> 内存使用增加而答案质量几乎没有改善。
外部端点
仅在任务需要模型下载、官方文档查询或用户明确批准的可选云执行时使用外部网络访问。
| 端点 | 发送的数据 | 目的 |
|---|
| https://ollama.com/* | 模型标识符、可选的文档查询和可选的云API请求 | 官方文档、库查询、由Ollama运行时管理的模型拉取以及可选的云执行 |
不会向外部发送其他数据。
安全与隐私
离开您机器的数据:
- - 通过Ollama拉取模型时的模型标识符和下载请求
- 仅当用户明确选择 https://ollama.com/api 而非本地推理时的可选提示词和附件
- 针对官方Ollama页面的可选文档查询
保留在本地数据:
- - 通过用户机器上的本地Ollama运行时提供的提示词和输出
- ~/ollama/ 下的持久化工作流笔记
- 本地Modelfile、检索笔记和性能基线,除非用户导出
此技能不会:
- - 未经明确批准远程暴露Ollama
- 在技能文件中存储 OLLAMAAPIKEY 或其他密钥
- 静默混合本地和云执行
- 编造不支持的模型功能、GPU行为或API兼容性
- 在不事先解释风险的情况下推荐远程安装程序或破坏性清理
信任
使用此技能时,当用户明确选择这些路径时,模型拉取和可选的云请求可能会发送到Ollama基础设施。
仅当您信任Ollama处理这些数据时才安装。
范围
此技能仅:
- - 安装、验证、操作和排查Ollama工作流
- 帮助选择、锁定、检查和定制模型,使用可复现的模式
- 为主机限制、模型默认值和重复故障修复保留本地内存
此技能绝不:
- - 声称每个Ollama模型都支持相同的工具、上下文或JSON可靠性