Setup
On first use, read setup.md and establish activation behavior, system scope, and data constraints before proposing implementation steps.
When to Use
User needs to create, redesign, or scale a search engine for applications, documentation, products, or internal knowledge bases. Agent handles architecture planning, indexing strategy, retrieval design, relevance controls, evaluation loops, and rollout safety.
Architecture
Memory lives in ~/search-engine/. See memory-template.md for baseline structure and status values.
CODEBLOCK0
Quick Reference
Use the smallest relevant file for the task.
| Topic | File |
|---|
| Setup and activation behavior | INLINECODE3 |
| Memory template and status model |
memory-template.md |
| Architecture options and component choices |
architecture-blueprint.md |
| Retrieval and ranking strategy patterns |
retrieval-patterns.md |
| Quality measurement and evaluation loops |
evaluation-metrics.md |
| Delivery and rollout gates |
implementation-checklist.md |
Data Storage
Local notes stay in ~/search-engine/:
- - requirements and relevance objectives
- data source assumptions and indexing decisions
- experiment outcomes and deployment safeguards
Core Rules
1. Start with a Retrieval Contract, Not with Tools
Before selecting engines, define the contract:
- - query types to support (keyword, phrase, semantic, hybrid)
- response format, latency budget, and freshness target
- error tolerance and fallback behavior
A search engine without a contract becomes an untestable collection of features.
2. Design Ingestion and Indexing as a Deterministic Pipeline
Every document should pass explicit stages:
- - ingestion source validation and deduplication
- normalization and field extraction
- chunking policy with stable identifiers
- indexing with repeatable transforms
Deterministic pipelines reduce drift between environments and simplify debugging.
3. Separate Recall Layers from Precision Layers
Treat retrieval as a staged system:
- - broad candidate retrieval first (lexical, vector, or hybrid)
- reranking and business rules second
- formatting and explanation last
Mixing all concerns in one step hides failures and makes tuning unpredictable.
4. Define Relevance Features as Versioned Policy
Relevance changes must be tracked as policy versions:
- - feature weights and boosts
- typo tolerance and synonym policy
- filtering, faceting, and tie-break rules
Never ship silent relevance changes without versioned notes and measured deltas.
5. Evaluate Offline Before Production Writes
For each relevance or indexing change:
- - run benchmark queries with labeled expectations
- measure hit quality, ordering quality, and coverage
- compare against current baseline and note regressions
If evaluation evidence is weak, keep the current configuration and iterate.
6. Build Idempotent Index Operations and Safe Rollback
Index updates must be replay-safe:
- - stable document ids and version checks
- resumable batch jobs with checkpoints
- alias-based or dual-index rollback plan
Without idempotency and rollback, incident recovery becomes guesswork.
7. Match Complexity to Workload Reality
Use the minimum architecture that meets requirements:
- - avoid distributed complexity for small datasets
- avoid simplistic models for multilingual or high-noise corpora
- revisit design as scale and usage patterns change
Over-engineering and under-engineering both create expensive rework.
Common Traps
- - Starting with vendor selection before defining retrieval requirements -> architecture lock-in with unclear success criteria
- Indexing raw data without field-level normalization -> poor filters, weak facets, and noisy matching
- Tuning relevance on one happy-path query set -> brittle results in real user traffic
- Applying business boosts without guardrails -> top results become commercially biased and less useful
- Shipping retrieval changes without offline baseline comparison -> regressions discovered only by users
- Running full reindex jobs without resumability -> long outages and partial data corruption
- Ignoring multilingual tokenization differences -> severe precision drop for non-English users
Security & Privacy
Data that leaves your machine:
- - none by default in this instruction set
- only user-approved integration traffic when the user explicitly connects external services
Data that stays local:
- - planning notes and experiment logs under INLINECODE10
- constraints, relevance decisions, and rollback records
This skill does NOT:
- - collect unrelated files or credentials
- require hidden network calls
- bypass user-confirmed environment boundaries
Related Skills
Install with
clawhub install <slug> if user confirms:
- -
api - Define stable APIs for indexing, querying, and retrieval orchestration - INLINECODE13 - Implement production indexing and query execution on Elasticsearch
- INLINECODE14 - Ship lightweight retrieval stacks with fast iteration cycles
- INLINECODE15 - Structure implementation workstreams and technical decision logs
- INLINECODE16 - Improve delivery quality with testable architecture and rollout discipline
Feedback
- - If useful: INLINECODE17
- Stay updated: INLINECODE18
设置
首次使用时,请阅读 setup.md,并在提出实施步骤前确立激活行为、系统范围和数据约束。
使用场景
用户需要为应用程序、文档、产品或内部知识库创建、重新设计或扩展搜索引擎。智能体负责架构规划、索引策略、检索设计、相关性控制、评估循环以及安全上线。
架构
记忆文件位于 ~/search-engine/。请参阅 memory-template.md 了解基础结构和状态值。
text
~/search-engine/
|-- memory.md # 持久化上下文、约束和当前优先级
|-- requirements.md # 检索目标、延迟目标和相关性期望
|-- experiments.md # 离线实验和调优决策
-- incidents.md # 生产问题、根本原因和修复记录
快速参考
根据任务使用最小的相关文件。
| 主题 | 文件 |
|---|
| 设置和激活行为 | setup.md |
| 记忆模板和状态模型 |
memory-template.md |
| 架构选项和组件选择 | architecture-blueprint.md |
| 检索和排序策略模式 | retrieval-patterns.md |
| 质量度量和评估循环 | evaluation-metrics.md |
| 交付和上线门禁 | implementation-checklist.md |
数据存储
本地笔记保存在 ~/search-engine/:
- - 需求和相关性目标
- 数据源假设和索引决策
- 实验结果和部署保障措施
核心规则
1. 从检索契约开始,而非工具
在选择引擎之前,先定义契约:
- - 支持的查询类型(关键词、短语、语义、混合)
- 响应格式、延迟预算和时效性目标
- 错误容忍度和回退行为
没有契约的搜索引擎会变成一堆不可测试的功能集合。
2. 将数据摄取和索引设计为确定性管道
每个文档都应经过明确的阶段:
- - 摄取源验证和去重
- 标准化和字段提取
- 使用稳定标识符进行分块
- 使用可重复的转换进行索引
确定性管道可减少环境间的偏差并简化调试。
3. 将召回层与精确层分离
将检索视为分阶段系统:
- - 首先进行广泛的候选检索(词法、向量或混合)
- 其次进行重排序和业务规则
- 最后进行格式化和解释
将所有关注点混合在一个步骤中会掩盖失败并使调优变得不可预测。
4. 将相关性特征定义为版本化策略
相关性变更必须作为策略版本进行跟踪:
- - 特征权重和提升
- 拼写容错和同义词策略
- 过滤、分面和决胜规则
切勿在没有版本化记录和测量差异的情况下发布静默的相关性变更。
5. 在生产写入前进行离线评估
对于每次相关性或索引变更:
- - 使用带标签期望的基准查询运行测试
- 衡量命中质量、排序质量和覆盖率
- 与当前基线进行比较并记录回归
如果评估证据不足,则保留当前配置并迭代。
6. 构建幂等的索引操作和安全回滚
索引更新必须可重放安全:
- - 稳定的文档ID和版本检查
- 带有检查点的可恢复批处理作业
- 基于别名或双索引的回滚计划
没有幂等性和回滚能力,故障恢复将变成猜测。
7. 使复杂度与工作负载实际相匹配
使用满足需求的最小架构:
- - 避免为小数据集引入分布式复杂性
- 避免为多语言或高噪声语料库使用过于简单的模型
- 随着规模和使用模式的变化重新审视设计
过度工程化和工程化不足都会导致昂贵的返工。
常见陷阱
- - 在定义检索需求之前开始选择供应商 -> 架构锁定且成功标准不明确
- 对原始数据进行索引而不进行字段级标准化 -> 过滤效果差、分面弱、匹配噪声大
- 在一组快乐路径查询上调整相关性 -> 在真实用户流量中结果脆弱
- 在没有防护措施的情况下应用业务提升 -> 顶部结果变得商业化且有偏差,实用性降低
- 在没有离线基线比较的情况下发布检索变更 -> 回归问题仅由用户发现
- 运行完整重建索引作业而不具备可恢复性 -> 长时间中断和部分数据损坏
- 忽略多语言分词差异 -> 非英语用户精确度严重下降
安全与隐私
离开您机器的数据:
- - 默认情况下此指令集不发送任何数据
- 仅当用户明确连接外部服务时,才发送用户批准的集成流量
本地保留的数据:
- - ~/search-engine/ 下的规划笔记和实验日志
- 约束条件、相关性决策和回滚记录
此技能不会:
- - 收集无关文件或凭据
- 要求隐藏的网络调用
- 绕过用户确认的环境边界
相关技能
如果用户确认,使用 clawhub install
安装:
- - api - 为索引、查询和检索编排定义稳定的API
- elasticsearch - 在Elasticsearch上实现生产级索引和查询执行
- meilisearch - 以快速迭代周期交付轻量级检索栈
- engineering - 结构化实施工作流和技术决策日志
- software-engineer - 通过可测试的架构和上线纪律提高交付质量
反馈
- - 如果觉得有用:clawhub star search-engine
- 保持更新:clawhub sync