fox-veille
# Skill Veille - RSS Aggregator
RSS feed aggregator with URL deduplication and topic-based deduplication for OpenClaw agents.
Fetches articles from 20+ configured sources, filters already-seen URLs (TTL 14 days),
and deduplicates articles covering the same story using Jaccard similarity + named entities.
No external dependencies: stdlib Python only (urllib, xml.etree, email.utils).
---
## Trigger phrases
- "fais une veille"
- "quoi de neuf en securite / tech / crypto / IA ?"
- "donne-moi les news du jour"
- "articles recents sur [sujet]"
- "veille RSS"
- "digest du matin"
- "nouvelles non vues"
---
## Quick Start
```bash
# 1. Setup
python3 scripts/setup.py
# 2. Validate
python3 scripts/init.py
# 3. Fetch + Score + Send (full pipeline)
python3 scripts/veille.py fetch --filter-seen --filter-topic \
| python3 scripts/veille.py score \
| python3 scripts/veille.py send
```
---
## Setup
### Requirements
- Python 3.9+
- Network access to RSS feeds (public, no auth required)
- No pip installs needed
### Installation
```bash
# From the skill directory
python3 scripts/setup.py
# Validate
python3 scripts/init.py
```
The wizard creates:
- `~/.openclaw/config/veille/config.json` (from `config.example.json`)
- `~/.openclaw/data/veille/` (data directory)
### Customizing sources
Edit `~/.openclaw/config/veille/config.json` and add/remove entries in the `"sources"` dict:
```json
{
"sources": {
"My Blog": "https://example.com/feed.xml",
"BleepingComputer": "https://www.bleepingcomputer.com/feed/"
}
}
```
---
## Storage and credentials
### Files written by this skill
| Path | Written by | Purpose | Contains secrets |
|------|-----------|---------|-----------------|
| `~/.openclaw/config/veille/config.json` | `setup.py` | Sources, outputs, options | NO |
| `~/.openclaw/data/veille/seen_urls.json` | `veille.py` | URL dedup store (TTL 14d) | NO |
| `~/.openclaw/data/veille/topic_seen.json` | `veille.py` | Topic dedup store (TTL 5d) | NO |
### Files read from outside the skill
| Path | Read by | Key accessed | When |
|------|---------|-------------|------|
| `~/.openclaw/openclaw.json` | `dispatch.py` | `channels.telegram.botToken` (read-only) | Only when `telegram_bot` output is enabled and no `bot_token` is set in the output config |
This is the only cross-config read. To avoid it entirely, set `bot_token` explicitly in your output config:
```json
{ "type": "telegram_bot", "bot_token": "YOUR_BOT_TOKEN", "chat_id": "...", "enabled": true }
```
### Output credentials (optional)
Credentials are only used if you enable the corresponding output. None are required for core functionality (RSS fetch + dedup).
| Output | Credential source | What is used |
|--------|-----------------|-------------|
| `telegram_bot` | `~/.openclaw/openclaw.json` or `bot_token` in output config | Bot token (read-only) |
| `mail-client` | Delegated to mail-client skill (its own creds) | Nothing read directly |
| `mail-client` (SMTP fallback) | `smtp_user` / `smtp_pass` in output config | SMTP login |
| `nextcloud` | Delegated to nextcloud-files skill (its own creds) | Nothing read directly |
### Cleanup on uninstall
```bash
python3 scripts/setup.py --cleanup
```
---
## Security model
### Credential isolation
- API keys are read from dedicated files (default `~/.openclaw/secrets/`), never from config.json. The scorer warns at runtime if a key file has overly permissive filesystem permissions.
- SMTP credentials (fallback only) are stored in the output config block — use the mail-client skill delegation to avoid storing SMTP passwords.
### Subprocess boundaries
- Dispatch delegates to other OpenClaw skills via `subprocess.run()` (never `shell=True`). Script paths are validated to reside under `~/.openclaw/workspace/skills/` before execution, preventing path traversal.
- No credentials are passed as subprocess arguments — each skill manages its own authentication.
### File output safety
- The `file` output type validates the target path before writing: only `~/.openclaw/` is allowed by default. Additional directories can be whitelisted via `config.security.allowed_output_dirs`. Sensitive paths (`.ssh`, `.gnupg`, `/etc/`, `.bashrc`, etc.) are always blocked regardless of allowlist.
- Written content is checked for suspicious patterns (shell shebangs, SSH keys, PGP blocks, code injection) and size-limited to 1 MB.
### Cross-config reads
- The only cross-config file read is `~/.openclaw/openclaw.json` for the Telegram bot token, and only when `telegram_bot` output is enabled without an explicit `bot_token`. This read is logged to stderr. Set `bot_token` in the output config to eliminate this read entirely.
### Autonomous dispatch
- When scheduled (cron), the skill can send messages/files to configured outputs without user interaction. All dispatch actions are logged to stderr with an audit summary. Use `enabled: false` on any output to disable it without removing its config.
---
## CLI reference
### `fetch`
```
python3 veille.py fetch [--hours N] [--filter-seen] [--filter-topic] [--sources FILE]
```
Options:
- `--hours N` : lookback window in hours (default: from config, usually 24)
- `--filter-seen` : filter already-seen URLs (uses seen_urls.json TTL store)
- `--filter-topic` : deduplicate by topic (uses topic_seen.json + Jaccard similarity)
- `--sources FILE` : path to custom JSON sources file
Output (JSON on stdout):
```json
{
"hours": 24,
"count": 42,
"skipped_url": 5,
"skipped_topic": 3,
"articles": [...],
"wrapped_listing": "=== UNTRUSTED EXTERNAL CONTENT ..."
}
```
### `seen-stats`
```
python3 veille.py seen-stats
```
Shows URL seen store statistics (count, TTL, file path).
### `topic-stats`
```
python3 veille.py topic-stats
```
Shows topic deduplication store statistics.
### `mark-seen`
```
python3 veille.py mark-seen URL [URL ...]
```
Marks one or more URLs as already seen (prevents them from appearing in future fetches with `--filter-seen`).
### `score`
```
python3 veille.py score [--dry-run]
```
Reads a digest JSON from stdin (output of `fetch`) and scores articles using an OpenAI-compatible LLM.
Returns enriched JSON with `scored`, `ghost_picks`, and per-article `score`/`reason` fields.
Options:
- `--dry-run` : print summary on stderr without calling the LLM API
When `llm.enabled` is `false` (default), articles pass through unchanged (`"scored": false`).
Pipeline usage:
```bash
python3 veille.py fetch --filter-seen --filter-topic | python3 veille.py score | python3 veille.py send
```
### `send`
```
python3 veille.py send [--profile NAME]
```
Reads a digest JSON from stdin and dispatches to all enabled outputs configured in `config.json`.
Accepts both raw fetch output (`articles` key) and LLM-processed digests (`categories` key).
Output types: `telegram_bot`, `mail-client`, `nextcloud`, `file`.
- `telegram_bot`: bot token auto-read from OpenClaw config - no extra setup if Telegram already configured.
- `mail-client`: delegates to mail-client skill if installed, falls back to raw SMTP config.
- `nextcloud`: delegates to nextcloud-files skill if installed (append mode by default with date separator).
- `file`: writes digest to a local file. Path must be under `~/.openclaw/` (default) or a directory listed in `config.security.allowed_output_dirs`. Sensitive paths and suspicious content are blocked (see Security model).
Configure outputs interactively:
```bash
python3 scripts/setup.py --manage-outputs
```
### `config`
```
python3 veille.py config
```
Prints the active configuration (no secrets).
---
## LLM scoring configuration
The `llm` key in `config.json` controls the optional LLM-based article scoring:
```json
{
"llm": {
"enabled": false,
"base_url": "https://api.openai.com/v1",
"api_key_file": "~/.openclaw/secrets/openai_api_key",
"model": "gpt-4o-mini",
"top_n": 10,
"ghost_threshold": 5
}
}
```
| Key | Default | Description |
|-----|---------|-------------|
| `enabled` | `false` | Enable LLM scoring (requires API key) |
| `base_url` | `https://api.openai.com/v1` | OpenAI-compatible API endpoint |
| `api_key_file` | `~/.openclaw/secrets/openai_api_key` | Path to file containing the API key |
| `model` | `gpt-4o-mini` | Model to use for scoring |
| `top_n` | `10` | Max articles to send to LLM per batch |
| `ghost_threshold` | `5` | Score threshold for `ghost_picks` (blog-worthy articles) |
Scoring rules:
- Only the first `top_n` articles are sent to the LLM. Articles beyond `top_n`
are excluded from the digest entirely. `fetch` returns articles sorted by date
desc, so `top_n` selects the most recent ones. Increase `top_n` to evaluate
more articles per run (higher token cost).
- Score >= `ghost_threshold` : added to `ghost_picks` list
- Score >= 3 : kept in `articles` list
- Score <= 2 : excluded from output
- Articles are sorted by score (descending)
When disabled, the `score` subcommand passes data through unchanged.
## Nextcloud output mode
The nextcloud output now defaults to **append mode** with a date separator. Each dispatch adds content below a `## YYYY-MM-DD HH:MM` header, preserving previous entries.
Set `"mode": "overwrite"` in the output config to restore the old behavior:
```json
{ "type": "nextcloud", "path": "/Veille/digest.md", "mode": "overwrite" }
```
## File output configuration
The `file` output writes digests to the local filesystem. By default, only paths under `~/.openclaw/` are allowed. To authorize additional directories, use `config.security.allowed_output_dirs`:
```json
{
"security": {
"allowed_output_dirs": [
"~/Documents/veille",
"/srv/digests"
]
}
}
```
**Blocked paths** (always rejected, even if inside an allowed directory):
`.ssh`, `.gnupg`, `.config/systemd`, `crontab`, `/etc/`, `.bashrc`, `.profile`, `.bash_profile`, `.zshrc`, `.env`
**Content validation** — written content is rejected if it:
- Exceeds 1 MB
- Contains shell shebangs (`#!/`), SSH keys, PGP blocks, or code injection patterns (`eval(`, `exec(`, `__import__(`, `import os`, `import subprocess`)
All blocked attempts are logged to stderr with the reason.
---
## Templates (agent usage)
### Basic digest
```python
# In agent tool call:
result = exec("python3 scripts/veille.py fetch --hours 24 --filter-seen --filter-topic")
data = json.loads(result.stdout)
# data["wrapped_listing"] is ready for LLM prompt injection
# data["count"] = number of new articles
# data["articles"] = list of article dicts
```
### Prompt template
```
You are a news analyst. Here are today's articles:
{data["wrapped_listing"]}
Please summarize the 5 most important stories, focusing on security and tech.
```
### Agent workflow example
```
1. Call veille fetch --filter-seen --filter-topic
2. Pipe through veille score (LLM scoring, if enabled)
3. If count > 0: pass wrapped_listing to LLM for analysis
4. LLM produces digest summary
5. Pipe through veille send (dispatches to configured outputs)
```
### Pipeline (CLI)
```bash
python3 scripts/veille.py fetch --filter-seen --filter-topic \
| python3 scripts/veille.py score \
| python3 scripts/veille.py send
```
### Filtering by keyword (post-fetch)
```python
data = json.loads(fetch_output)
security_articles = [
a for a in data["articles"]
if any(kw in a["title"].lower() for kw in ["cve", "vuln", "patch", "breach"])
]
```
---
## Ideas
- Add keyword-based filtering (`--keywords security,cve,linux`)
- Add per-source TTL override in config
- Export digest as HTML or Markdown
- Schedule with cron: `0 8 * * * python3 veille.py fetch --filter-seen --filter-topic`
- Weight articles by source tier for LLM prioritization
- Add OPML import/export for source list management
- Integrate with ntfy or Telegram for real-time alerts on high-priority articles
---
## Combine with
- **mail-client** : send the digest by email after fetching
```
veille fetch --filter-seen | ... | mail-client send
```
- **nextcloud-files** : archive the daily digest as a Markdown file
```
veille fetch --filter-seen | jq .wrapped_listing -r > /tmp/digest.md
nextcloud-files upload /tmp/digest.md /Digests/$(date +%Y-%m-%d).md
```
---
## Troubleshooting
See `references/troubleshooting.md` for detailed troubleshooting steps.
Common issues:
- **No articles returned**: check `--hours` value, verify feed URLs in config
- **XML parse error on a feed**: some feeds use non-standard XML; the skill skips broken items silently
- **All articles filtered as seen**: run `seen-stats` to check store size; reset with `rm seen_urls.json`
- **Import error**: ensure you run `veille.py` from its directory or via full path
- **File output blocked**: path is outside `~/.openclaw/` — add the target directory to `config.security.allowed_output_dirs` (see File output configuration)
标签
skill
ai