Scraper
Turn messy public pages into clean, reusable data.
Core Purpose
Scraper is a safe extraction skill for public, user-authorized pages.
It helps the agent:
- - fetch page content from a URL
- extract readable text
- strip boilerplate where possible
- save clean output locally
- prepare content for later summarization or analysis
Safety Boundaries
- - Only use on public or user-authorized pages
- Do not bypass logins, paywalls, captchas, robots restrictions, or rate limits
- Do not request or store credentials
- Do not perform stealth scraping, account creation, or identity evasion
- Save outputs locally only
Runtime Requirements
- - Python 3 must be available as INLINECODE0
- No external packages required
Local Storage
All outputs are stored locally under:
Key Workflows
- - Capture a page: INLINECODE3
- Extract readable text: INLINECODE4
- Save cleaned content: INLINECODE5
- List prior jobs: INLINECODE6
Scripts
| Script | Purpose |
|---|
| INLINECODE7 | Initialize scraper storage |
| INLINECODE8 |
Download a page with standard headers |
|
extract_text.py | Convert HTML into cleaned plain text |
|
save_output.py | Save extracted output and register a job |
|
list_jobs.py | Show past scraping jobs |
Scraper
将杂乱的公开页面转化为干净、可复用的数据。
核心用途
Scraper 是一种针对公开或用户授权页面的安全提取技能。
它帮助智能体:
- - 从 URL 获取页面内容
- 提取可读文本
- 尽可能去除样板内容
- 将清理后的输出保存到本地
- 为后续的摘要或分析准备内容
安全边界
- - 仅用于公开或用户授权的页面
- 不得绕过登录、付费墙、验证码、爬虫限制或速率限制
- 不得请求或存储凭据
- 不得进行隐蔽爬取、创建账户或规避身份识别
- 仅将输出保存到本地
运行环境要求
- - 必须提供 Python 3,命令为 python3
- 无需外部包
本地存储
所有输出均存储在本地以下路径:
- - ~/.openclaw/workspace/memory/scraper/jobs.json
- ~/.openclaw/workspace/memory/scraper/output/
关键工作流程
- - 捕获页面:fetchpage.py --url https://example.com
- 提取可读文本:extracttext.py --url https://example.com
- 保存清理后的内容:saveoutput.py --url https://example.com --title Example
- 列出历史任务:listjobs.py
脚本
| 脚本 | 用途 |
|---|
| initstorage.py | 初始化爬取存储 |
| fetchpage.py |
使用标准请求头下载页面 |
| extract_text.py | 将 HTML 转换为清理后的纯文本 |
| save_output.py | 保存提取的输出并注册任务 |
| list_jobs.py | 显示历史爬取任务 |