Zotero Vectorize
Build and maintain a local-first, cross-platform Zotero vector store for semantic search and RAG over bibliographic metadata and PDF full text.
Keep SKILL.md focused on workflow. Read the reference files only when needed:
- -
references/config.md — paths, environment variables, output layout - INLINECODE2 — JSON schemas and file naming
- INLINECODE3 /
macos.md / linux.md — platform-specific path defaults and notes - INLINECODE6 — common failures and recovery
Core rules
- - Treat Zotero as read-only input. Never modify the user’s Zotero database or attachment storage.
- Prefer creating a database snapshot before reading.
- For incremental updates: check first, report missing items, wait for user confirmation, then apply.
- Before any update that rewrites store files: back up first, then write.
- Backup retention for this skill is fixed: keep only the latest and previous backup per file.
- Default output filenames are:
-
metadata_vectors.json
-
fulltext_vectors.json
- INLINECODE9
Workflow decision tree
1) Detect or confirm paths
If the Zotero data directory, database path, or storage path is unknown:
- 1. Read INLINECODE10
- Read the platform-specific reference (
windows.md, macos.md, or linux.md) - Run:
CODEBLOCK0
If the detected paths are wrong, ask the user to open Zotero and use Show Data Directory, then rerun with explicit --data-dir, --db, or --storage-dir.
2) Create a database snapshot
Before full builds or incremental checks, snapshot the Zotero database:
CODEBLOCK1
If snapshotting fails because SQLite is locked, ask the user to close Zotero and retry.
3) Build the metadata vector store
Use this when the user asks to create or rebuild metadata embeddings for the Zotero library.
CODEBLOCK2
This writes metadata_vectors.json and refreshes vector_store_metadata.json + README.md.
4) Build the full-text vector store
Use this when the user asks to create or rebuild PDF full-text embeddings.
CODEBLOCK3
This scans Zotero PDF attachments, extracts text, chunks it, embeds each chunk, and writes fulltext_vectors.json.
5) Check incremental updates
Use this when the user asks whether Zotero contains new items not yet added to the vector store.
CODEBLOCK4
Report:
- - total top-level Zotero items
- total PDF-parent items
- current metadata/fulltext vector counts
- missing metadata items
- missing fulltext items
Do not update the store yet.
6) Apply incremental updates
Only run this after the user confirms the update.
CODEBLOCK5
This script:
- 1. snapshots the DB
- backs up store files
- appends missing metadata/fulltext entries
- keeps only the latest and previous backup per file
- updates store metadata and README
Use --item-id to limit the update to specific items if the user wants a partial apply.
7) Verify the finished store
After any build or incremental update, verify counts and sizes:
CODEBLOCK6
Always report:
- - metadata item count
- fulltext item count
- fulltext chunk count
- metadata file size
- fulltext file size
Scripts
- -
scripts/detect_zotero_paths.py — resolve default/current Zotero paths - INLINECODE23 — create a safe SQLite snapshot
- INLINECODE24 — full rebuild of metadata vectors
- INLINECODE25 — full rebuild of PDF full-text vectors
- INLINECODE26 — compare Zotero against current vector store
- INLINECODE27 — append missing items after user confirmation
- INLINECODE28 — back up store files and retain only the latest two states
- INLINECODE29 — report counts, sizes, and store metadata
Output expectations
When using this skill successfully, return concise operational summaries such as:
- - detected paths
- snapshot path used
- number of items/chunks written
- current file sizes
- whether any items are missing
- which itemIDs were appended during incremental update
Escalation notes
Read references/troubleshooting.md when:
- - SQLite snapshot fails
- HuggingFace/model download or local model loading fails
- PDFs are missing or unreadable
- full-text extraction is incomplete
- file paths differ from defaults on the current OS
Zotero 向量化
构建并维护一个本地优先、跨平台的 Zotero 向量存储,用于对文献元数据和 PDF 全文进行语义搜索和 RAG。
保持 SKILL.md 专注于工作流程。仅在需要时读取参考文件:
- - references/config.md — 路径、环境变量、输出布局
- references/data-format.md — JSON 模式和文件命名规则
- references/windows.md / macos.md / linux.md — 特定平台的路径默认值和说明
- references/troubleshooting.md — 常见故障和恢复方法
核心规则
- - 将 Zotero 视为只读输入。切勿修改用户的 Zotero 数据库或附件存储。
- 在读取前,优先创建数据库快照。
- 对于增量更新:先检查,报告缺失条目,等待用户确认,再执行。
- 在任何会重写存储文件的更新之前:先备份,再写入。
- 本技能的备份保留策略固定:每个文件仅保留最新和上一个备份。
- 默认输出文件名:
- metadata_vectors.json
- fulltext_vectors.json
- vector
storemetadata.json
工作流程决策树
1) 检测或确认路径
如果 Zotero 数据目录、数据库路径或存储路径未知:
- 1. 读取 references/config.md
- 读取特定平台的参考文件(windows.md、macos.md 或 linux.md)
- 运行:
bash
python scripts/detectzoteropaths.py
如果检测到的路径错误,请让用户打开 Zotero 并使用显示数据目录功能,然后使用显式的 --data-dir、--db 或 --storage-dir 参数重新运行。
2) 创建数据库快照
在完整构建或增量检查之前,创建 Zotero 数据库的快照:
bash
python scripts/snapshotzoterodb.py --output-dir
如果由于 SQLite 被锁定而导致快照失败,请让用户关闭 Zotero 并重试。
3) 构建元数据向量存储
当用户要求为 Zotero 库创建或重建元数据嵌入时使用此步骤。
bash
python scripts/buildmetadatavectors.py --output-dir
此操作会写入 metadatavectors.json,并刷新 vectorstore_metadata.json 和 README.md。
4) 构建全文向量存储
当用户要求创建或重建 PDF 全文嵌入时使用此步骤。
bash
python scripts/buildfulltextvectors.py --output-dir
此操作会扫描 Zotero PDF 附件,提取文本,进行分块,对每个块进行嵌入,并写入 fulltext_vectors.json。
5) 检查增量更新
当用户询问 Zotero 是否包含尚未添加到向量存储的新条目时使用此步骤。
bash
python scripts/checkincrementalupdates.py --output-dir
报告:
- - Zotero 顶层条目总数
- 包含 PDF 的父条目总数
- 当前元数据/全文向量数量
- 缺失的元数据条目
- 缺失的全文条目
不要更新存储。
6) 执行增量更新
仅在用户确认更新后运行此步骤。
bash
python scripts/applyincrementalupdates.py --output-dir
此脚本会:
- 1. 创建数据库快照
- 备份存储文件
- 追加缺失的元数据/全文条目
- 每个文件仅保留最新和上一个备份
- 更新存储元数据和 README
如果用户希望部分应用,可使用 --item-id 将更新限制为特定条目。
7) 验证完成的存储
在任何构建或增量更新之后,验证数量和大小:
bash
python scripts/verifyvectorstore.py --output-dir
始终报告:
- - 元数据条目数量
- 全文条目数量
- 全文块数量
- 元数据文件大小
- 全文文件大小
脚本
- - scripts/detectzoteropaths.py — 解析默认/当前的 Zotero 路径
- scripts/snapshotzoterodb.py — 创建安全的 SQLite 快照
- scripts/buildmetadatavectors.py — 完整重建元数据向量
- scripts/buildfulltextvectors.py — 完整重建 PDF 全文向量
- scripts/checkincrementalupdates.py — 比较 Zotero 与当前向量存储
- scripts/applyincrementalupdates.py — 在用户确认后追加缺失条目
- scripts/backupwithretention.py — 备份存储文件,仅保留最新的两个状态
- scripts/verifyvectorstore.py — 报告数量、大小和存储元数据
输出预期
成功使用此技能时,返回简洁的操作摘要,例如:
- - 检测到的路径
- 使用的快照路径
- 写入的条目/块数量
- 当前文件大小
- 是否有缺失的条目
- 增量更新期间追加了哪些 itemID
升级说明
在以下情况读取 references/troubleshooting.md:
- - SQLite 快照失败
- HuggingFace/模型下载或本地模型加载失败
- PDF 缺失或无法读取
- 全文提取不完整
- 文件路径与当前操作系统的默认值不同