Tech News Digest
Automated tech news digest system with unified data source model, quality scoring pipeline, and template-based output generation.
Quick Start
- 1. Configuration Setup: Default configs are in
config/defaults/. Copy to workspace for customization:
CODEBLOCK0
- 2. Environment Variables:
-
TWITTERAPI_IO_KEY - twitterapi.io API key (optional, preferred)
-
X_BEARER_TOKEN - Twitter/X official API bearer token (optional, fallback)
-
TAVILY_API_KEY - Tavily Search API key, alternative to Brave (optional)
-
WEB_SEARCH_BACKEND - Web search backend: auto|brave|tavily (optional, default: auto)
-
BRAVE_API_KEYS - Brave Search API keys, comma-separated for rotation (optional)
-
BRAVE_API_KEY - Single Brave key fallback (optional)
-
GITHUB_TOKEN - GitHub personal access token (optional, improves rate limits)
- 3. Generate Digest:
CODEBLOCK1
- 4. Use Templates: Apply Discord, email, or PDF templates to merged output
Configuration Files
sources.json - Unified Data Sources
CODEBLOCK2
topics.json - Enhanced Topic Definitions
CODEBLOCK3
Scripts Pipeline
run-pipeline.py - Unified Pipeline (Recommended)
python3 scripts/run-pipeline.py \
--defaults config/defaults [--config CONFIG_DIR] \
--hours 48 --freshness pd \
--archive-dir workspace/archive/tech-news-digest/ \
--output /tmp/td-merged.json --verbose --force
- - Features: Runs all 6 fetch steps in parallel, then merges + deduplicates + scores
- Output: Final merged JSON ready for report generation (~30s total)
- Metadata: Saves per-step timing and counts to INLINECODE11
- GitHub Auth: Auto-generates GitHub App token if
$GITHUB_TOKEN not set - Fallback: If this fails, run individual scripts below
Individual Scripts (Fallback)
fetch-rss.py - RSS Feed Fetcher
python3 scripts/fetch-rss.py [--defaults DIR] [--config DIR] [--hours 48] [--output FILE] [--verbose]
- - Parallel fetching (10 workers), retry with backoff, feedparser + regex fallback
- Timeout: 30s per feed, ETag/Last-Modified caching
fetch-twitter.py - Twitter/X KOL Monitor
python3 scripts/fetch-twitter.py [--defaults DIR] [--config DIR] [--hours 48] [--output FILE] [--backend auto|official|twitterapiio]
- - Backend auto-detection: uses twitterapi.io if
TWITTERAPI_IO_KEY set, else official X API v2 if X_BEARER_TOKEN set - Rate limit handling, engagement metrics, retry with backoff
fetch-web.py - Web Search Engine
python3 scripts/fetch-web.py [--defaults DIR] [--config DIR] [--freshness pd] [--output FILE]
- - Auto-detects Brave API rate limit: paid plans → parallel queries, free → sequential
- Without API: generates search interface for agents
fetch-github.py - GitHub Releases Monitor
python3 scripts/fetch-github.py [--defaults DIR] [--config DIR] [--hours 168] [--output FILE]
- - Parallel fetching (10 workers), 30s timeout
- Auth priority:
$GITHUB_TOKEN → GitHub App auto-generate → gh CLI → unauthenticated (60 req/hr)
fetch-github.py --trending - GitHub Trending Repos
python3 scripts/fetch-github.py --trending [--hours 48] [--output FILE] [--verbose]
- - Searches GitHub API for trending repos across 4 topics (LLM, AI Agent, Crypto, Frontier Tech)
- Quality scoring: base 5 + dailystarsest / 10, max 15
fetch-reddit.py - Reddit Posts Fetcher
python3 scripts/fetch-reddit.py [--defaults DIR] [--config DIR] [--hours 48] [--output FILE]
- - Parallel fetching (4 workers), public JSON API (no auth required)
- 13 subreddits with score filtering
enrich-articles.py - Article Full-Text Enrichment
python3 scripts/enrich-articles.py --input merged.json --output enriched.json [--min-score 10] [--max-articles 15] [--verbose]
- - Fetches full article text for high-scoring articles
- Cloudflare Markdown for Agents (preferred) → HTML extraction (fallback) → Skip (paywalled/social)
- Blog domain whitelist with lower score threshold (≥3)
- Parallel fetching (5 workers, 10s timeout)
merge-sources.py - Quality Scoring & Deduplication
python3 scripts/merge-sources.py --rss FILE --twitter FILE --web FILE --github FILE --reddit FILE
- - Quality scoring, title similarity dedup (85%), previous digest penalty
- Output: topic-grouped articles sorted by score
validate-config.py - Configuration Validator
python3 scripts/validate-config.py [--defaults DIR] [--config DIR] [--verbose]
- - JSON schema validation, topic reference checks, duplicate ID detection
generate-pdf.py - PDF Report Generator
python3 scripts/generate-pdf.py --input report.md --output digest.pdf [--verbose]
- - Converts markdown digest to styled A4 PDF with Chinese typography (Noto Sans CJK SC)
- Emoji icons, page headers/footers, blue accent theme. Requires
weasyprint.
sanitize-html.py - Safe HTML Email Converter
python3 scripts/sanitize-html.py --input report.md --output email.html [--verbose]
- - Converts markdown to XSS-safe HTML email with inline CSS
- URL whitelist (http/https only), HTML-escaped text content
source-health.py - Source Health Monitor
python3 scripts/source-health.py --rss FILE --twitter FILE --github FILE --reddit FILE --web FILE [--verbose]
- - Tracks per-source success/failure history over 7 days
- Reports unhealthy sources (>50% failure rate)
summarize-merged.py - Merged Data Summary
python3 scripts/summarize-merged.py --input merged.json [--top N] [--topic TOPIC]
- - Human-readable summary of merged data for LLM consumption
- Shows top articles per topic with scores and metrics
User Customization
Workspace Configuration Override
Place custom configs in
workspace/config/ to override defaults:
- - Sources: Append new sources, disable defaults with INLINECODE32
- Topics: Override topic definitions, search queries, display settings
- Merge Logic:
- Sources with same
id → user version takes precedence
- Sources with new
id → appended to defaults
- Topics with same
id → user version completely replaces default
Example Workspace Override
CODEBLOCK18
Templates & Output
Discord Template (references/templates/discord.md)
- - Bullet list format with link suppression (
<link>) - Mobile-optimized, emoji headers
- 2000 character limit awareness
Email Template (references/templates/email.md)
- - Rich metadata, technical stats, archive links
- Executive summary, top articles section
- HTML-compatible formatting
PDF Template (references/templates/pdf.md)
- - A4 layout with Noto Sans CJK SC font for Chinese support
- Emoji icons, page headers/footers with page numbers
- Generated via
scripts/generate-pdf.py (requires weasyprint)
Default Sources (151 total)
- - RSS Feeds (62): AI labs, tech blogs, crypto news, Chinese tech media
- Twitter/X KOLs (48): AI researchers, crypto leaders, tech executives
- GitHub Repos (28): Major open-source projects (LangChain, vLLM, DeepSeek, Llama, etc.)
- Reddit (13): r/MachineLearning, r/LocalLLaMA, r/CryptoCurrency, r/ChatGPT, r/OpenAI, etc.
- Web Search (4 topics): LLM, AI Agent, Crypto, Frontier Tech
All sources pre-configured with appropriate topic tags and priority levels.
Dependencies
CODEBLOCK19
Optional but Recommended:
- -
feedparser>=6.0.0 - Better RSS parsing (fallback to regex if unavailable) - INLINECODE43 - Configuration validation
All scripts work with Python 3.8+ standard library only.
Monitoring & Operations
Health Checks
CODEBLOCK20
Archive Management
- - Digests automatically archived to INLINECODE44
- Previous digest titles used for duplicate detection
- Old archives cleaned automatically (90+ days)
Error Handling
- - Network Failures: Retry with exponential backoff
- Rate Limits: Automatic retry with appropriate delays
- Invalid Content: Graceful degradation, detailed logging
- Configuration Errors: Schema validation with helpful messages
API Keys & Environment
Set in ~/.zshenv or similar:
CODEBLOCK21
- - Twitter:
TWITTERAPI_IO_KEY preferred ($3-5/mo); X_BEARER_TOKEN as fallback; auto mode tries twitterapiio first - Web Search: Tavily (preferred in auto mode) or Brave; optional, fallback to agent web_search if unavailable
- GitHub: Auto-generates token from GitHub App if PAT not set; unauthenticated fallback (60 req/hr)
- Reddit: No API key needed (uses public JSON API)
Cron / Scheduled Task Integration
OpenClaw Cron (Recommended)
The cron prompt should NOT hardcode the pipeline steps. Instead, reference references/digest-prompt.md and only pass configuration parameters. This ensures the pipeline logic stays in the skill repo and is consistent across all installations.
Daily Digest Cron Prompt
CODEBLOCK22
Weekly Digest Cron Prompt
CODEBLOCK23
Why This Pattern?
- - Single source of truth: Pipeline logic lives in
digest-prompt.md, not scattered across cron configs - Portable: Same skill on different OpenClaw instances, just change paths and channel IDs
- Maintainable: Update the skill → all cron jobs pick up changes automatically
- Anti-pattern: Do NOT copy pipeline steps into the cron prompt — it will drift out of sync
Multi-Channel Delivery Limitation
OpenClaw enforces
cross-provider isolation: a single session can only send messages to one provider (e.g., Discord OR Telegram, not both). If you need to deliver digests to multiple platforms, create
separate cron jobs for each provider:
CODEBLOCK24
Replace DISCORD_CHANNEL_ID delivery with the target platform's delivery in the second job's prompt.
This is a security feature, not a bug — it prevents accidental cross-context data leakage.
Security Notes
Execution Model
This skill uses a
prompt template pattern: the agent reads
digest-prompt.md and follows its instructions. This is the standard OpenClaw skill execution model — the agent interprets structured instructions from skill-provided files. All instructions are shipped with the skill bundle and can be audited before installation.
Network Access
The Python scripts make outbound requests to:
- - RSS feed URLs (configured in
tech-news-digest-sources.json) - Twitter/X API (
api.x.com or api.twitterapi.io) - Brave Search API (
api.search.brave.com) - Tavily Search API (
api.tavily.com) - GitHub API (
api.github.com) - Reddit JSON API (
reddit.com)
No data is sent to any other endpoints. All API keys are read from environment variables declared in the skill metadata.
Shell Safety
Email delivery uses
send-email.py which constructs proper MIME multipart messages with HTML body + optional PDF attachment. Subject formats are hardcoded (
Daily Tech Digest - YYYY-MM-DD). PDF generation uses
generate-pdf.py via
weasyprint. The prompt template explicitly prohibits interpolating untrusted content (article titles, tweet text, etc.) into shell arguments. Email addresses and subjects must be static placeholder values only.
File Access
Scripts read from
config/ and write to
workspace/archive/. No files outside the workspace are accessed.
Support & Troubleshooting
Common Issues
- 1. RSS feeds failing: Check network connectivity, use
--verbose for details - Twitter rate limits: Reduce sources or increase interval
- Configuration errors: Run
validate-config.py for specific issues - No articles found: Check time window (
--hours) and source enablement
Debug Mode
All scripts support
--verbose flag for detailed logging and troubleshooting.
Performance Tuning
- - Parallel Workers: Adjust
MAX_WORKERS in scripts for your system - Timeout Settings: Increase
TIMEOUT for slow networks - Article Limits: Adjust
MAX_ARTICLES_PER_FEED based on needs
Security Considerations
Shell Execution
The digest prompt instructs agents to run Python scripts via shell commands. All script paths and arguments are skill-defined constants — no user input is interpolated into commands. Two scripts use
subprocess:
- -
run-pipeline.py orchestrates child fetch scripts (all within scripts/ directory) - INLINECODE76 has two subprocess calls:
1.
openssl dgst -sha256 -sign for JWT signing (only if
GH_APP_* env vars are set — signs a self-constructed JWT payload, no user content involved)
2.
gh auth token CLI fallback (only if
gh is installed — reads from gh's own credential store)
No user-supplied or fetched content is ever interpolated into subprocess arguments. Email delivery uses send-email.py which builds MIME messages programmatically — no shell interpolation. PDF generation uses generate-pdf.py via weasyprint. Email subjects are static format strings only — never constructed from fetched data.
Credential & File Access
Scripts do
not directly read
~/.config/,
~/.ssh/, or any credential files. All API tokens are read from environment variables declared in the skill metadata. The GitHub auth cascade is:
- 1.
$GITHUB_TOKEN env var (you control what to provide) - GitHub App token generation (only if you set
GH_APP_ID, GH_APP_INSTALL_ID, and GH_APP_KEY_FILE — uses inline JWT signing via openssl CLI, no external scripts involved) - INLINECODE91 CLI (delegates to gh's own secure credential store)
- Unauthenticated (60 req/hr, safe fallback)
If you prefer no automatic credential discovery, simply set $GITHUB_TOKEN and the script will use it directly without attempting steps 2-3.
Dependency Installation
This skill does
not install any packages.
requirements.txt lists optional dependencies (
feedparser,
jsonschema) for reference only. All scripts work with Python 3.8+ standard library. Users should install optional deps in a virtualenv if desired — the skill never runs
pip install.
Input Sanitization
- - URL resolution rejects non-HTTP(S) schemes (javascript:, data:, etc.)
- RSS fallback parsing uses simple, non-backtracking regex patterns (no ReDoS risk)
- All fetched content is treated as untrusted data for display only
Network Access
Scripts make outbound HTTP requests to configured RSS feeds, Twitter API, GitHub API, Reddit JSON API, Brave Search API, and Tavily Search API. No inbound connections or listeners are created.
技术新闻摘要
自动化技术新闻摘要系统,具有统一数据源模型、质量评分流水线和基于模板的输出生成功能。
快速开始
- 1. 配置设置:默认配置位于 config/defaults/。复制到工作区进行自定义:
bash
mkdir -p workspace/config
cp config/defaults/sources.json workspace/config/tech-news-digest-sources.json
cp config/defaults/topics.json workspace/config/tech-news-digest-topics.json
- 2. 环境变量:
- TWITTERAPI
IOKEY - twitterapi.io API密钥(可选,推荐)
- X
BEARERTOKEN - Twitter/X官方API承载令牌(可选,备用)
- TAVILY
APIKEY - Tavily搜索API密钥,Brave的替代方案(可选)
- WEB
SEARCHBACKEND - 网络搜索后端:auto|brave|tavily(可选,默认:auto)
- BRAVE
APIKEYS - Brave搜索API密钥,逗号分隔用于轮换(可选)
- BRAVE
APIKEY - 单个Brave密钥备用(可选)
- GITHUB_TOKEN - GitHub个人访问令牌(可选,提高速率限制)
- 3. 生成摘要:
bash
# 统一流水线(推荐)— 并行运行所有6个数据源 + 合并
python3 scripts/run-pipeline.py \
--defaults config/defaults \
--config workspace/config \
--hours 48 --freshness pd \
--archive-dir workspace/archive/tech-news-digest/ \
--output /tmp/td-merged.json --verbose --force
- 4. 使用模板:对合并后的输出应用Discord、电子邮件或PDF模板
配置文件
sources.json - 统一数据源
json
{
sources: [
{
id: openai-rss,
type: rss,
name: OpenAI博客,
url: https://openai.com/blog/rss.xml,
enabled: true,
priority: true,
topics: [llm, ai-agent],
note: OpenAI官方更新
},
{
id: sama-twitter,
type: twitter,
name: Sam Altman,
handle: sama,
enabled: true,
priority: true,
topics: [llm, frontier-tech],
note: OpenAI首席执行官
}
]
}
topics.json - 增强主题定义
json
{
topics: [
{
id: llm,
emoji: 🧠,
label: LLM / 大模型,
description: 大型语言模型、基础模型、突破性进展,
search: {
queries: [LLM最新新闻, 大型语言模型突破],
must_include: [LLM, 大型语言模型, 基础模型],
exclude: [教程, 初学者指南]
},
display: {
max_items: 8,
style: detailed
}
}
]
}
脚本流水线
run-pipeline.py - 统一流水线(推荐)
bash
python3 scripts/run-pipeline.py \
--defaults config/defaults [--config CONFIG_DIR] \
--hours 48 --freshness pd \
--archive-dir workspace/archive/tech-news-digest/ \
--output /tmp/td-merged.json --verbose --force
- - 功能:并行运行所有6个获取步骤,然后合并+去重+评分
- 输出:最终合并的JSON,可用于报告生成(总计约30秒)
- 元数据:将每个步骤的时间和计数保存到*.meta.json
- GitHub认证:如果未设置$GITHUB_TOKEN,自动生成GitHub应用令牌
- 备用方案:如果失败,运行下面的单个脚本
单个脚本(备用)
fetch-rss.py - RSS订阅源获取器
bash
python3 scripts/fetch-rss.py [--defaults DIR] [--config DIR] [--hours 48] [--output FILE] [--verbose]
- - 并行获取(10个工作线程),带退避的重试,feedparser + 正则表达式备用
- 超时:每个订阅源30秒,ETag/Last-Modified缓存
fetch-twitter.py - Twitter/X KOL监控
bash
python3 scripts/fetch-twitter.py [--defaults DIR] [--config DIR] [--hours 48] [--output FILE] [--backend auto|official|twitterapiio]
- - 后端自动检测:如果设置了TWITTERAPIIOKEY则使用twitterapi.io,否则如果设置了XBEARERTOKEN则使用官方X API v2
- 速率限制处理、互动指标、带退避的重试
fetch-web.py - 网络搜索引擎
bash
python3 scripts/fetch-web.py [--defaults DIR] [--config DIR] [--freshness pd] [--output FILE]
- - 自动检测Brave API速率限制:付费计划→并行查询,免费→顺序执行
- 无API:为代理生成搜索界面
fetch-github.py - GitHub发布监控
bash
python3 scripts/fetch-github.py [--defaults DIR] [--config DIR] [--hours 168] [--output FILE]
- - 并行获取(10个工作线程),30秒超时
- 认证优先级:$GITHUB_TOKEN → GitHub应用自动生成 → gh CLI → 未认证(60次/小时)
fetch-github.py --trending - GitHub趋势仓库
bash
python3 scripts/fetch-github.py --trending [--hours 48] [--output FILE] [--verbose]
- - 搜索GitHub API获取4个主题(LLM、AI代理、加密货币、前沿科技)的趋势仓库
- 质量评分:基础5分 + dailystarsest / 10,最高15分
fetch-reddit.py - Reddit帖子获取器
bash
python3 scripts/fetch-reddit.py [--defaults DIR] [--config DIR] [--hours 48] [--output FILE]
- - 并行获取(4个工作线程),公共JSON API(无需认证)
- 13个子版块,带评分过滤
enrich-articles.py - 文章全文丰富
bash
python3 scripts/enrich-articles.py --input merged.json --output enriched.json [--min-score 10] [--max-articles 15] [--verbose]
- - 获取高分文章的全文
- Cloudflare Markdown for Agents(推荐)→ HTML提取(备用)→ 跳过(付费墙/社交)
- 博客域名白名单,评分阈值较低(≥3)
- 并行获取(5个工作线程,10秒超时)
merge-sources.py - 质量评分与去重
bash
python3 scripts/merge-sources.py --rss FILE --twitter FILE --web FILE --github FILE --reddit FILE
- - 质量评分、标题相似度去重(85%)、先前摘要惩罚
- 输出:按评分排序的按主题分组的文章
validate-config.py - 配置验证器
bash
python3 scripts/validate-config.py [--defaults DIR] [--config DIR] [--verbose]
generate-pdf.py - PDF报告生成器
bash
python3 scripts/generate-pdf.py --input report.md --output digest.pdf [--verbose]
- - 将Markdown摘要转换为带中文排版的样式化A4 PDF(Noto Sans CJK SC)
- 表情符号图标、页眉/页脚、蓝色主题。需要weasyprint。
sanitize-html.py - 安全HTML电子邮件转换器
bash
python3 scripts/sanitize-html.py --input report.md --output email.html [--verbose]
- - 将Markdown转换为带内联CSS的XSS安全HTML电子邮件
- URL白名单(仅http/https)、HTML转义文本内容
source-health.py - 数据源健康监控
bash
python3 scripts/source-health.py --rss FILE --twitter FILE --github FILE --reddit FILE --web FILE [--verbose]
- - 跟踪每个数据源7天内的成功/失败历史
- 报告不健康的数据源(失败率>50%)
summarize-merged.py - 合并数据摘要
bash
python3 scripts/summarize-merged.py --input merged.json [--top N] [--topic TOPIC]