Search Engine

Setup

On first use, read setup.md and establish activation behavior, system scope, and data constraints before proposing implementation steps.

When to Use

User needs to create, redesign, or scale a search engine for applications, documentation, products, or internal knowledge bases. Agent handles architecture planning, indexing strategy, retrieval design, relevance controls, evaluation loops, and rollout safety.

Architecture

Memory lives in ~/search-engine/. See memory-template.md for baseline structure and status values.

CODEBLOCK0

Quick Reference

Use the smallest relevant file for the task.

Topic	File
Setup and activation behavior	INLINECODE3
Memory template and status model

Data Storage

Local notes stay in ~/search-engine/:

- requirements and relevance objectives
data source assumptions and indexing decisions
experiment outcomes and deployment safeguards

Core Rules

1. Start with a Retrieval Contract, Not with Tools

Before selecting engines, define the contract:

- query types to support (keyword, phrase, semantic, hybrid)
response format, latency budget, and freshness target
error tolerance and fallback behavior

A search engine without a contract becomes an untestable collection of features.

2. Design Ingestion and Indexing as a Deterministic Pipeline

Every document should pass explicit stages:

- ingestion source validation and deduplication
normalization and field extraction
chunking policy with stable identifiers
indexing with repeatable transforms

Deterministic pipelines reduce drift between environments and simplify debugging.

3. Separate Recall Layers from Precision Layers

Treat retrieval as a staged system:

- broad candidate retrieval first (lexical, vector, or hybrid)
reranking and business rules second
formatting and explanation last

Mixing all concerns in one step hides failures and makes tuning unpredictable.

4. Define Relevance Features as Versioned Policy

Relevance changes must be tracked as policy versions:

- feature weights and boosts
typo tolerance and synonym policy
filtering, faceting, and tie-break rules

Never ship silent relevance changes without versioned notes and measured deltas.

5. Evaluate Offline Before Production Writes

For each relevance or indexing change:

- run benchmark queries with labeled expectations
measure hit quality, ordering quality, and coverage
compare against current baseline and note regressions

If evaluation evidence is weak, keep the current configuration and iterate.

6. Build Idempotent Index Operations and Safe Rollback

Index updates must be replay-safe:

- stable document ids and version checks
resumable batch jobs with checkpoints
alias-based or dual-index rollback plan

Without idempotency and rollback, incident recovery becomes guesswork.

7. Match Complexity to Workload Reality

Use the minimum architecture that meets requirements:

- avoid distributed complexity for small datasets
avoid simplistic models for multilingual or high-noise corpora
revisit design as scale and usage patterns change

Over-engineering and under-engineering both create expensive rework.

Common Traps

- Starting with vendor selection before defining retrieval requirements -> architecture lock-in with unclear success criteria
Indexing raw data without field-level normalization -> poor filters, weak facets, and noisy matching
Tuning relevance on one happy-path query set -> brittle results in real user traffic
Applying business boosts without guardrails -> top results become commercially biased and less useful
Shipping retrieval changes without offline baseline comparison -> regressions discovered only by users
Running full reindex jobs without resumability -> long outages and partial data corruption
Ignoring multilingual tokenization differences -> severe precision drop for non-English users

Security & Privacy

Data that leaves your machine:

- none by default in this instruction set
only user-approved integration traffic when the user explicitly connects external services

Data that stays local:

- planning notes and experiment logs under INLINECODE10
constraints, relevance decisions, and rollback records

This skill does NOT:

- collect unrelated files or credentials
require hidden network calls
bypass user-confirmed environment boundaries

Related Skills

Install with clawhub install <slug> if user confirms:

- api - Define stable APIs for indexing, querying, and retrieval orchestration
INLINECODE13 - Implement production indexing and query execution on Elasticsearch
INLINECODE14 - Ship lightweight retrieval stacks with fast iteration cycles
INLINECODE15 - Structure implementation workstreams and technical decision logs
INLINECODE16 - Improve delivery quality with testable architecture and rollout discipline

Feedback

- If useful: INLINECODE17
Stay updated: INLINECODE18

设置

首次使用时，请阅读 setup.md，并在提出实施步骤前确立激活行为、系统范围和数据约束。

使用场景

用户需要为应用程序、文档、产品或内部知识库创建、重新设计或扩展搜索引擎。智能体负责架构规划、索引策略、检索设计、相关性控制、评估循环以及安全上线。

架构

记忆文件位于 ~/search-engine/。请参阅 memory-template.md 了解基础结构和状态值。

text
~/search-engine/
|-- memory.md # 持久化上下文、约束和当前优先级
|-- requirements.md # 检索目标、延迟目标和相关性期望
|-- experiments.md # 离线实验和调优决策
-- incidents.md # 生产问题、根本原因和修复记录

快速参考

根据任务使用最小的相关文件。

主题	文件
设置和激活行为	setup.md
记忆模板和状态模型

数据存储

本地笔记保存在 ~/search-engine/：

- 需求和相关性目标
数据源假设和索引决策
实验结果和部署保障措施

核心规则

1. 从检索契约开始，而非工具

在选择引擎之前，先定义契约：

- 支持的查询类型（关键词、短语、语义、混合）
响应格式、延迟预算和时效性目标
错误容忍度和回退行为

没有契约的搜索引擎会变成一堆不可测试的功能集合。

2. 将数据摄取和索引设计为确定性管道

每个文档都应经过明确的阶段：

- 摄取源验证和去重
标准化和字段提取
使用稳定标识符进行分块
使用可重复的转换进行索引

确定性管道可减少环境间的偏差并简化调试。

3. 将召回层与精确层分离

将检索视为分阶段系统：

- 首先进行广泛的候选检索（词法、向量或混合）
其次进行重排序和业务规则
最后进行格式化和解释

将所有关注点混合在一个步骤中会掩盖失败并使调优变得不可预测。

4. 将相关性特征定义为版本化策略

相关性变更必须作为策略版本进行跟踪：

- 特征权重和提升
拼写容错和同义词策略
过滤、分面和决胜规则

切勿在没有版本化记录和测量差异的情况下发布静默的相关性变更。

5. 在生产写入前进行离线评估

对于每次相关性或索引变更：

- 使用带标签期望的基准查询运行测试
衡量命中质量、排序质量和覆盖率
与当前基线进行比较并记录回归

如果评估证据不足，则保留当前配置并迭代。

6. 构建幂等的索引操作和安全回滚

索引更新必须可重放安全：

- 稳定的文档ID和版本检查
带有检查点的可恢复批处理作业
基于别名或双索引的回滚计划

没有幂等性和回滚能力，故障恢复将变成猜测。

7. 使复杂度与工作负载实际相匹配

使用满足需求的最小架构：

- 避免为小数据集引入分布式复杂性
避免为多语言或高噪声语料库使用过于简单的模型
随着规模和使用模式的变化重新审视设计

过度工程化和工程化不足都会导致昂贵的返工。

常见陷阱

- 在定义检索需求之前开始选择供应商 -> 架构锁定且成功标准不明确
对原始数据进行索引而不进行字段级标准化 -> 过滤效果差、分面弱、匹配噪声大
在一组快乐路径查询上调整相关性 -> 在真实用户流量中结果脆弱
在没有防护措施的情况下应用业务提升 -> 顶部结果变得商业化且有偏差，实用性降低
在没有离线基线比较的情况下发布检索变更 -> 回归问题仅由用户发现
运行完整重建索引作业而不具备可恢复性 -> 长时间中断和部分数据损坏
忽略多语言分词差异 -> 非英语用户精确度严重下降

安全与隐私

离开您机器的数据：

- 默认情况下此指令集不发送任何数据
仅当用户明确连接外部服务时，才发送用户批准的集成流量

本地保留的数据：

- ~/search-engine/ 下的规划笔记和实验日志
约束条件、相关性决策和回滚记录

此技能不会：

- 收集无关文件或凭据
要求隐藏的网络调用
绕过用户确认的环境边界

反馈

- 如果觉得有用：clawhub star search-engine
保持更新：clawhub sync

Search Engine搜索引擎

Setup

When to Use

Architecture

Quick Reference

Data Storage

Core Rules

1. Start with a Retrieval Contract, Not with Tools

2. Design Ingestion and Indexing as a Deterministic Pipeline

3. Separate Recall Layers from Precision Layers

4. Define Relevance Features as Versioned Policy

5. Evaluate Offline Before Production Writes

6. Build Idempotent Index Operations and Safe Rollback

7. Match Complexity to Workload Reality

Common Traps

Security & Privacy

Related Skills

Feedback

设置

使用场景

架构

快速参考

数据存储

核心规则

1. 从检索契约开始，而非工具

2. 将数据摄取和索引设计为确定性管道

3. 将召回层与精确层分离

4. 将相关性特征定义为版本化策略

5. 在生产写入前进行离线评估

6. 构建幂等的索引操作和安全回滚

7. 使复杂度与工作负载实际相匹配

常见陷阱

安全与隐私

相关技能

反馈

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement