Douban Self Taste Skill

Collect the user's own Douban history, keep a local cache fresh, and analyze it carefully.

Scope

Use this skill only for the user's own Douban data, including:

- the user's own movie / book / music / game shelves
the user's own ratings, tags, short comments, reviews, and dates
local exports, saved HTML pages, or cached JSON derived from the user's own account
fresh re-crawls of the user's own logged-in pages when cache is missing or stale

Do not use this skill for public-user scraping, whole-site crawling, hidden/private data claims, or MCP-server design.

Storage layout

Use these paths unless the user explicitly asks for a different layout:

- cookies: INLINECODE0
crawl cache: INLINECODE1
analysis outputs: INLINECODE2

Treat the cache as reusable local working data. Do not scatter generated files across the repo.

Read references/storage-layout.md for exact file naming conventions.

Required workflow

Follow this order.

1. Decide whether crawling is needed

Check whether local crawl cache already exists for the requested category.

- If no cache exists, crawling is needed.
If cache exists but its fetched_at timestamp is older than 7 days, crawling is needed.
Otherwise, reuse the local cache.

Prefer the smallest sufficient refresh.

- If the user asks about books, prioritize book cache.
If the user asks about movies, prioritize movie cache.
You may use small amounts of other categories as weak supplementary context, but keep the requested category primary.

2. If crawling is needed, verify cookie availability

Check whether the cookie file exists and is plausibly usable.

Treat cookies as unavailable when:

- the cookie file is missing
the cookie file is empty or malformed
the crawl clearly redirects to login or otherwise fails due to authentication

If cookies are unavailable or expired, ask the user for fresh cookies before crawling.

Do not pretend a crawl succeeded when authentication failed.

3. Crawl and persist locally

When cookies are available, crawl the user's own Douban shelves and store the refreshed result in local JSON cache files.

Use scripts/crawl_douban_self_history.py for logged-in crawling.
Use scripts/extract_douban_self_history.py when the user already has saved HTML files.

After crawling:

- save normalized JSON to the cache directory
include INLINECODE7
keep category and status explicit
preserve raw comments and rating information

4. Analyze after data is ready

Only start analysis after confirming that either:

- fresh cache exists, or
a successful new crawl has been saved locally

Use scripts/build_taste_profile.py to build an analysis-ready summary when helpful.
Write the summary into .local/douban-self-taste/analysis/ when the user wants a reusable analysis artifact.

Analysis priorities

Always pay extra attention to:

- items with comments
high-rated items
low-rated items
recent items
category boundaries

For scripts/build_taste_profile.py, use these summary rules:

- Do not include the full items array in the profile output; keep full records in the crawl cache.
Keep the rest of the summary reasonably rich; avoid large deletions unless the user asks.
Define recent_items as the newest dated items sorted by date descending, capped at 20 items.
Define high_rated_items as all items tied at the user's highest observed rating within the focused dataset; if there are more than 20, keep only the most recent 20 by date.
Define low_rated_items as all items tied at the user's lowest observed rating within the focused dataset; if there are more than 20, keep only the most recent 20 by date.
Treat game tag analysis separately from creator analysis; games may have useful genre/platform-like tags but often do not have reliable creators.
Filter noisy book creators when obvious publisher / bookstore / distribution-style strings appear.
Prefer category-specific cleaning over one generic parser when extracting tags or creators.

When the user asks about one category, analyze that category first.

Examples:

- Book questions → use books as primary evidence; only lightly reference movies/music/games if they add meaningful support.
Movie questions → use movies as primary evidence.

Separate:

- stable preferences
weak signals
aversions / anti-preferences
recent shifts

Do not overfit from tiny samples.

Output expectations

Start with factual scope:

1. what data was used
whether it came from cache or a fresh crawl
cache age
category coverage
obvious data gaps

Then provide analysis.
Keep generated profile files compact enough for downstream LLM analysis; prefer concise summaries over repeating the entire dataset.

Bundled resources

- Read references/storage-layout.md for local file locations.
Read references/data-sources.md for cache/cookie refresh logic.
Read references/output-schema.md for normalized JSON structure.
Read references/analysis-rubric.md before writing conclusions.
Use scripts/crawl_douban_self_history.py to refresh local cache from logged-in pages.
Use scripts/extract_douban_self_history.py to convert saved HTML files into normalized JSON.
Use scripts/build_taste_profile.py to generate category-aware summaries.

豆瓣个人品味技能

收集用户自己的豆瓣历史记录，保持本地缓存的新鲜度，并进行细致分析。

适用范围

仅将此技能用于用户自己的豆瓣数据，包括：

- 用户自己的电影/图书/音乐/游戏收藏
用户自己的评分、标签、短评、评论和日期
从用户自己账户导出的本地文件、保存的HTML页面或缓存的JSON
当缓存缺失或过期时，对用户已登录页面进行重新抓取

请勿将此技能用于抓取公共用户数据、全站爬取、隐藏/私有数据声明或MCP服务器设计。

存储结构

除非用户明确要求不同的结构，否则使用以下路径：

- cookies：.local/douban-self-taste/cookies/douban_cookies.json
爬取缓存：.local/douban-self-taste/cache/collections/
分析输出：.local/douban-self-taste/analysis/

将缓存视为可复用的本地工作数据。请勿将生成的文件散落在仓库各处。

请阅读 references/storage-layout.md 了解具体的文件命名规范。

必需的工作流程

请按以下顺序执行。

1. 判断是否需要爬取

检查请求的分类是否已存在本地爬取缓存。

- 如果不存在缓存，则需要爬取。
如果缓存存在但其 fetched_at 时间戳超过7天，则需要爬取。
否则，复用本地缓存。

优先进行最小范围的必要更新。

- 如果用户询问图书，优先更新图书缓存。
如果用户询问电影，优先更新电影缓存。
可以少量使用其他分类作为辅助参考，但保持请求的分类为主要数据源。

2. 如果需要爬取，验证cookie可用性

检查cookie文件是否存在且基本可用。

以下情况视为cookie不可用：

- cookie文件缺失
cookie文件为空或格式错误
爬取时明显跳转到登录页面，或因认证问题失败

如果cookie不可用或已过期，在爬取前请向用户索要新的cookie。

当认证失败时，不要假装爬取成功。

3. 爬取并本地持久化

当cookie可用时，爬取用户自己的豆瓣收藏，并将更新后的结果存储在本地JSON缓存文件中。

使用 scripts/crawldoubanself_history.py 进行登录状态下的爬取。
当用户已有保存的HTML文件时，使用 scripts/extractdoubanself_history.py。

爬取后：

- 将标准化后的JSON保存到缓存目录
包含 fetched_at 字段
明确标注分类和状态
保留原始评论和评分信息

4. 数据就绪后进行分析

仅在确认以下任一条件满足后才开始分析：

- 存在新鲜的缓存，或
成功的新爬取结果已保存到本地

当需要时，使用 scripts/buildtasteprofile.py 构建分析就绪的摘要。
当用户希望保留可复用的分析产物时，将摘要写入 .local/douban-self-taste/analysis/。

分析优先级

始终特别关注：

- 带有评论的项目
高评分项目
低评分项目
近期项目
分类边界

对于 scripts/buildtasteprofile.py，使用以下摘要规则：

- 不要在概要输出中包含完整的 items 数组；将完整记录保留在爬取缓存中。
保持摘要其余部分足够丰富；除非用户要求，否则避免大量删除。
将 recentitems 定义为按日期降序排列的最新项目，最多20条。
将 highrateditems 定义为在聚焦数据集中与用户最高评分持平的所有项目；如果超过20条，仅保留按日期排序的最新的20条。
将 lowrated_items 定义为在聚焦数据集中与用户最低评分持平的所有项目；如果超过20条，仅保留按日期排序的最新的20条。
将游戏标签分析与创作者分析分开处理；游戏可能有有用的类型/平台类标签，但通常没有可靠的创作者。
当明显的出版商/书店/发行商类字符串出现时，过滤图书中的噪声创作者。
在提取标签或创作者时，优先使用分类特定的清洗方法，而非通用解析器。

当用户询问某个分类时，优先分析该分类。

示例：

- 图书问题 → 以图书为主要证据；仅在能提供有意义支持时，少量参考电影/音乐/游戏。
电影问题 → 以电影为主要证据。

区分：

- 稳定的偏好
弱信号
厌恶/反偏好
近期变化

不要从小样本中过度拟合。

输出要求

从事实范围开始：

1. 使用了哪些数据
数据来自缓存还是新爬取
缓存时效
分类覆盖范围
明显的数据缺口

然后提供分析。
保持生成的概要文件足够紧凑，便于下游LLM分析；优先使用简洁的摘要，而非重复整个数据集。

附带资源

- 阅读 references/storage-layout.md 了解本地文件位置。
阅读 references/data-sources.md 了解缓存/cookie更新逻辑。
阅读 references/output-schema.md 了解标准化JSON结构。
在撰写结论前阅读 references/analysis-rubric.md。
使用 scripts/crawldoubanselfhistory.py 从登录页面刷新本地缓存。
使用 scripts/extractdoubanselfhistory.py 将保存的HTML文件转换为标准化JSON。
使用 scripts/buildtasteprofile.py 生成分类感知的摘要。

douban-self-taste-skill豆瓣品味分析