Xiaohongshu Search and Summarize
This skill automates the process of extracting high-quality multi-modal content (text + images) from Xiaohongshu (小红书) and actively assists you in generating a deeply integrated, analytical final report for the user. Due to Xiaohongshu's aggressive anti-scraping mechanisms, direct HTTP requests or naive scraping often result in 404s or blocks. This skill natively bypasses these by simulating a real user through the playwright-cli in a headed browser window.
It operates in two distinct phases:
Phase 1: Subagent Data Collection
- 1. Simulate a search for the keyword on Xiaohongshu in a headed browser.
- Advance through image sliders to fully load all lazy pictures from the top N posts.
- Extract titles, descriptions, top comments, and all high-resolution images.
- Download those images to a local directory and generate a raw data document (
[keyword]_raw_data.md).
Phase 2: AI Multi-Modal Synthesis (Your Job)
- 5. You MUST use your file reading capabilities to read the
[keyword]_raw_data.md file. - Inside the raw data markdown, you will find paths to image files. You MUST use your file reading / vision capabilities on these image file paths to actually ingest and "see" their visual content. If you skip this step, you are only reading file names, not the images themselves!
- You analyze the texts, summarize the genuinely useful comments (discarding noise like "pm me"), and interpret the semantic content of the images you just viewed (e.g. diagrams, guidelines, step-by-step UI flows).
- You compile everything into a beautifully synthesized, single comprehensive report rather than just a linear list of posts.
Dependencies
- -
playwright-cli (Must be available on the path) - INLINECODE4 (Required to download images and stitch the raw data markdown)
- INLINECODE5 Python package (
pip install requests) — used by parse.py to download images
Usage Instructions
Step 1: Run the Extraction Script
Execute the wrapper script in scripts/run.sh. It accepts the following arguments:
CODEBLOCK0
- -
YOUR KEYWORD: The search term to look up on Xiaohongshu. <MAX_POSTS>: (Optional, default = 10) The number of top posts to scan.<OUTPUT_DIRECTORY>: (Optional, default = ./) Directory where the raw data and images will be saved.
Example execution:
CODEBLOCK1
Step 2: Read Raw Data & Images
Once the bash script finishes successfully, navigate to the OUTPUT_DIRECTORY and use your file reading capabilities to ingest the generated [keyword]_raw_data.md file.
Inside this file, you will find descriptions, comments, and file paths pointing to post_X_img_Y.webp or post_X_img_Y.jpg.
Step 3: Synthesis & Summarization
This is the most critical step. Do not just return the raw markdown file to the user. Instead, write a polished comprehensive markdown report that reorganizes the information logically, while retaining a high level of detail.
Follow these strict compilation rules:
- - Do not list posts individually (e.g. avoid "Post 1: ... Post 2: ...").
- Read the Images: You MUST use your file reading and vision capabilities on the
.webp or .jpg image files found in the raw data directory to interpret their contents. - Detailed & Comprehensive Synthesis: Provide a highly detailed summary that includes diverse viewpoints, nuances, and specific examples found across different posts. Avoid over-summarizing or losing important context; preserve the richness and diversity of the information.
- Extract and merge themes: Group ideas by concepts, steps, recurring themes, or pros/cons.
- Evaluate comments: Merge insights from valuable comments directly into the core narrative. Skip useless or repetitive comments, but preserve diverse opinions or helpful counter-arguments from the comments section.
- Integrate images contextually: Embed the most relevant and high-quality images directly into the flow of your final report to support the analytical points being made. Describe their visual meaning based on what you saw with your vision capabilities.
- Save to OUTPUT_DIRECTORY: Save your beautifully compiled final Markdown report using your file writing capabilities directly into the same
<OUTPUT_DIRECTORY> as the raw data (e.g., <OUTPUT_DIRECTORY>/[keyword]_synthesis.md), and give the user the path to it.
Error Handling
If you encounter 404 Not Found or "element not visible" errors during the browser invocation:
- - Keep in mind that Xiaohongshu may demand a login challenge. If the site pauses waiting for a login, instruct the user to verify the
playwright-cli browser window and perform necessary authentication manually, then try the script again.
小红书搜索与摘要
该技能自动化了从小红书提取高质量多模态内容(文本+图像)的过程,并主动协助您为用户生成深度整合的分析性最终报告。由于小红书具有激进的反爬虫机制,直接的HTTP请求或简单的爬取通常会导致404错误或被屏蔽。该技能通过playwright-cli在有头浏览器窗口中模拟真实用户,原生地绕过了这些限制。
该技能分两个不同阶段运行:
第一阶段:子代理数据收集
- 1. 在有头浏览器中模拟在小红书上搜索关键词。
- 前进通过图片滑块,完全加载前N篇帖子的所有懒加载图片。
- 提取标题、描述、热门评论以及所有高分辨率图片。
- 将这些图片下载到本地目录,并生成原始数据文档([keyword]rawdata.md)。
第二阶段:AI多模态合成(您的任务)
- 5. 您必须使用文件读取能力来读取[keyword]rawdata.md文件。
- 在原始数据markdown文件中,您将找到图片文件的路径。您必须对这些图片文件路径使用文件读取/视觉能力,以实际摄取并看到它们的视觉内容。如果您跳过此步骤,您将只读取文件名,而不是图片本身!
- 您分析文本,总结真正有用的评论(忽略私信我等噪音),并解释您刚刚查看的图片的语义内容(例如图表、指南、分步UI流程)。
- 您将所有内容整合成一个精美合成的、单一的综合报告,而不仅仅是帖子的线性列表。
依赖项
- - playwright-cli(必须在路径中可用)
- python3(用于下载图片和拼接原始数据markdown)
- requests Python包(pip install requests)——由parse.py用于下载图片
使用说明
第一步:运行提取脚本
执行scripts/run.sh中的包装脚本。它接受以下参数:
bash
/bin/bash /scripts/run.sh 您的关键词 <最大帖子数> <输出目录>
- - 您的关键词:要在小红书上搜索的查询词。
- <最大帖子数>:(可选,默认=10)要扫描的热门帖子数量。
- <输出目录>:(可选,默认=./)保存原始数据和图片的目录。
执行示例:
bash
/bin/bash ~/.claude/skills/xiaohongshu-search-summarizer/scripts/run.sh openclaw使用场景 10 ./xhsreportopenclaw_scenarios
第二步:读取原始数据与图片
bash脚本成功完成后,导航到输出目录,并使用您的文件读取能力来摄取生成的[keyword]rawdata.md文件。
在此文件中,您将找到描述、评论以及指向postXimgY.webp或postXimgY.jpg的文件路径。
第三步:合成与摘要
这是最关键的一步。不要仅仅将原始markdown文件返回给用户。相反,编写一份精美的综合markdown报告,逻辑性地重新组织信息,同时保留高水平的细节。
遵循以下严格的编译规则:
- - 不要逐条列出帖子(例如避免帖子1:……帖子2:……)。
- 读取图片:您必须对原始数据目录中找到的.webp或.jpg图片文件使用文件读取和视觉能力,以解释其内容。
- 详细且全面的合成:提供高度详细的摘要,包括不同帖子中发现的不同观点、细微差别和具体示例。避免过度概括或丢失重要上下文;保留信息的丰富性和多样性。
- 提取并合并主题:按概念、步骤、重复出现的主题或优缺点对观点进行分组。
- 评估评论:将有价值的评论中的见解直接合并到核心叙述中。跳过无用或重复的评论,但保留评论部分中的不同观点或有帮助的反驳意见。
- 上下文整合图片:将最相关和高质量的图片直接嵌入到最终报告的流程中,以支持正在进行的分析点。根据您通过视觉能力看到的内容描述它们的视觉含义。
- 保存到输出目录:使用您的文件写入能力,将您精美编译的最终Markdown报告直接保存到与原始数据相同的<输出目录>中(例如<输出目录>/[keyword]_synthesis.md),并将路径提供给用户。
错误处理
如果在浏览器调用过程中遇到404 Not Found或元素不可见错误:
- - 请记住,小红书可能要求登录验证。如果网站暂停等待登录,请指示用户验证playwright-cli浏览器窗口并手动执行必要的身份验证,然后重新运行脚本。