DataPulse Skill (v0.8.1)
Use this skill when the user needs one or more of the following:
- - Read or batch-read URLs across X, Reddit, YouTube, Bilibili, Telegram, WeChat, Xiaohongshu, RSS, arXiv, Hacker News, GitHub, and generic web pages
- Search the web, inspect trending topics, or collect cross-platform signals
- Create watch missions, alert routes, triage queues, or story evidence packs
- Run assistant-ready URL intake through INLINECODE0
Python Entry Point
CODEBLOCK0
Core Capabilities
- - URL ingestion with normalized
DataPulseItem output - Confidence scoring and ranking
- Web search and trending discovery
- Watch missions and alert routing
- Triage queue and story workspace workflows
Behavior Disclosure
Browser Automation (optional)
DataPulse uses Playwright for platforms that require authenticated browser sessions (WeChat, Xiaohongshu). Browser automation is opt-in only — it activates when the user explicitly runs a login command and a valid session file exists. The playwright dependency is optional (pip install datapulse[browser]). No browser launches occur during normal URL reading.
Subprocess Calls
- - MCP transport: Story and triage modules invoke
subprocess.run() to communicate with MCP tool servers via subprocess_json transport (stdin/stdout JSON-RPC). All calls have explicit timeouts (30s default). - YouTube fallback: The YouTube collector may call
yt-dlp as a subprocess for audio transcript extraction when the native API is unavailable. - CLI update check: The CLI invokes
pip install --upgrade only when the user explicitly runs --upgrade.
No subprocess call runs silently or without user-initiated action.
Local Persistence
- - Session files: Playwright login sessions are saved to
~/.datapulse/sessions/ for reuse. Sessions are TTL-cached (12h) and can be invalidated via invalidate_session_cache(). - Data files: Watch missions, alert routes, triage queues, story workspaces, and entity stores persist as JSON files under the working directory (
data/ folder). All writes use atomic save patterns.
No data is written outside the working directory or ~/.datapulse/ without explicit user action.
Outbound HTTP (alert delivery)
When the user configures alert routes, DataPulse sends POST requests to user-specified endpoints:
- - Webhook: arbitrary URL provided by the user
- Feishu: Feishu bot webhook URL provided by the user
- Telegram: Telegram Bot API (
api.telegram.org) using a user-provided bot token
Alert delivery only fires when: (1) a watch mission matches new content, AND (2) the user has explicitly configured a route with a destination URL or token. No outbound POST occurs without user-configured routes.
Local Server (optional)
INLINECODE15 starts a local FastAPI/Uvicorn HTTP server for the browser-based console UI. It binds to localhost by default and is never started automatically — only when the user explicitly runs datapulse-console or python -m datapulse.console_server.
External API Calls (read-only)
Normal operation makes outbound GET/POST requests to:
- - Jina AI (
r.jina.ai, s.jina.ai): URL reading and web search (requires JINA_API_KEY) - Tavily (
api.tavily.com): web search (requires TAVILY_API_KEY) - Groq (
api.groq.com): YouTube audio transcription fallback (requires GROQ_API_KEY) - Target URLs: the URLs the user asks to read
All API keys are read from environment variables; none are bundled or hard-coded.
Environment Notes
- - Python INLINECODE26
- Optional search enhancement:
JINA_API_KEY, INLINECODE28 - Optional platform enhancement:
TG_API_ID, TG_API_HASH, INLINECODE31 - Optional browser sessions:
pip install datapulse[browser] (Playwright) - Optional console UI:
pip install datapulse[console] (FastAPI + Uvicorn)
DataPulse 技能 (v0.8.1)
当用户需要以下一项或多项功能时,请使用此技能:
- - 读取或批量读取 X、Reddit、YouTube、Bilibili、Telegram、微信、小红书、RSS、arXiv、Hacker News、GitHub 以及通用网页的 URL
- 搜索网络、查看热门话题或收集跨平台信号
- 创建监控任务、警报路由、分类队列或故事证据包
- 通过 datapulse_skill.run() 运行助手就绪的 URL 接收
Python 入口点
python
from datapulse_skill import run
run(请处理这些链接: https://x.com/... https://www.reddit.com/...)
核心能力
- - URL 接收,输出标准化的 DataPulseItem
- 置信度评分与排序
- 网络搜索与热门发现
- 监控任务与警报路由
- 分类队列与故事工作区工作流
行为说明
浏览器自动化(可选)
DataPulse 使用 Playwright 处理需要认证浏览器会话的平台(微信、小红书)。浏览器自动化仅为选择加入——只有当用户明确运行 login 命令且存在有效会话文件时才会激活。playwright 依赖项是可选的(pip install datapulse[browser])。正常 URL 读取期间不会启动浏览器。
子进程调用
- - MCP 传输:故事和分类模块调用 subprocess.run() 通过 subprocess_json 传输(stdin/stdout JSON-RPC)与 MCP 工具服务器通信。所有调用都有明确的超时时间(默认 30 秒)。
- YouTube 回退:当原生 API 不可用时,YouTube 收集器可能调用 yt-dlp 作为子进程进行音频转录提取。
- CLI 更新检查:CLI 仅在用户明确运行 --upgrade 时调用 pip install --upgrade。
没有子进程调用会在未经用户发起操作的情况下静默运行。
本地持久化
- - 会话文件:Playwright 登录会话保存到 ~/.datapulse/sessions/ 以便重用。会话具有 TTL 缓存(12 小时),可通过 invalidatesessioncache() 使其失效。
- 数据文件:监控任务、警报路由、分类队列、故事工作区和实体存储以 JSON 文件形式持久化在工作目录下(data/ 文件夹)。所有写入均使用原子保存模式。
未经用户明确操作,不会在工作目录或 ~/.datapulse/ 之外写入数据。
出站 HTTP(警报投递)
当用户配置警报路由时,DataPulse 向用户指定的端点发送 POST 请求:
- - Webhook:用户提供的任意 URL
- 飞书:用户提供的飞书机器人 webhook URL
- Telegram:使用用户提供的机器人令牌的 Telegram Bot API(api.telegram.org)
警报投递仅在以下情况下触发:(1)监控任务匹配到新内容,且(2)用户已明确配置包含目标 URL 或令牌的路由。没有用户配置的路由,不会发生出站 POST 请求。
本地服务器(可选)
datapulse-console 启动一个本地 FastAPI/Uvicorn HTTP 服务器,用于基于浏览器的控制台 UI。它默认绑定到 localhost,并且永远不会自动启动——只有当用户明确运行 datapulse-console 或 python -m datapulse.console_server 时才会启动。
外部 API 调用(只读)
正常操作会向以下地址发出出站 GET/POST 请求:
- - Jina AI(r.jina.ai、s.jina.ai):URL 读取和网络搜索(需要 JINAAPIKEY)
- Tavily(api.tavily.com):网络搜索(需要 TAVILYAPIKEY)
- Groq(api.groq.com):YouTube 音频转录回退(需要 GROQAPIKEY)
- 目标 URL:用户要求读取的 URL
所有 API 密钥均从环境变量读取;没有捆绑或硬编码的密钥。
环境说明
- - Python 3.10+
- 可选搜索增强:JINAAPIKEY、TAVILYAPIKEY
- 可选平台增强:TGAPIID、TGAPIHASH、GROQAPIKEY
- 可选浏览器会话:pip install datapulse[browser](Playwright)
- 可选控制台 UI:pip install datapulse[console](FastAPI + Uvicorn)