scraper网页数据提取

Structured extraction and cleanup for public, user-authorized web pages. Use when the user wants to collect, clean, summarize, or transform content from accessible pages into reusable text or data. Do not use to bypass logins, paywalls, captchas, robots restrictions, or access controls. Local-only output.

作者: admin | 来源: ClawHub

Scraper

Turn messy public pages into clean, reusable data.

Core Purpose

Scraper is a safe extraction skill for public, user-authorized pages. It helps the agent:

- fetch page content from a URL
extract readable text
strip boilerplate where possible
save clean output locally
prepare content for later summarization or analysis

Safety Boundaries

- Only use on public or user-authorized pages
Do not bypass logins, paywalls, captchas, robots restrictions, or rate limits
Do not request or store credentials
Do not perform stealth scraping, account creation, or identity evasion
Save outputs locally only

Runtime Requirements

- Python 3 must be available as INLINECODE0
No external packages required

Local Storage

All outputs are stored locally under:

- INLINECODE1
INLINECODE2

Key Workflows

- Capture a page: INLINECODE3
Extract readable text: INLINECODE4
Save cleaned content: INLINECODE5
List prior jobs: INLINECODE6

Scripts
Script Purpose
INLINECODE7 Initialize scraper storage
INLINECODE8
Download a page with standard headers |

Script	Purpose
INLINECODE7	Initialize scraper storage
INLINECODE8

Scraper

将杂乱的公开页面转化为干净、可复用的数据。

核心用途

Scraper 是一种针对公开或用户授权页面的安全提取技能。它帮助智能体：

- 从 URL 获取页面内容
提取可读文本
尽可能去除样板内容
将清理后的输出保存到本地
为后续的摘要或分析准备内容

安全边界

- 仅用于公开或用户授权的页面
不得绕过登录、付费墙、验证码、爬虫限制或速率限制
不得请求或存储凭据
不得进行隐蔽爬取、创建账户或规避身份识别
仅将输出保存到本地

运行环境要求

- 必须提供 Python 3，命令为 python3
无需外部包

本地存储

所有输出均存储在本地以下路径：

- ~/.openclaw/workspace/memory/scraper/jobs.json
~/.openclaw/workspace/memory/scraper/output/

关键工作流程

- 捕获页面：fetchpage.py --url https://example.com
提取可读文本：extracttext.py --url https://example.com
保存清理后的内容：saveoutput.py --url https://example.com --title Example
列出历史任务：listjobs.py

脚本
脚本用途
initstorage.py 初始化爬取存储
fetchpage.py
使用标准请求头下载页面 |

scraper网页数据提取

scraper

Scraper

Core Purpose

Safety Boundaries

Runtime Requirements

Local Storage

Key Workflows

Scripts
Script Purpose
INLINECODE7 Initialize scraper storage
INLINECODE8
Download a page with standard headers |

Scraper

核心用途

安全边界

运行环境要求

本地存储

关键工作流程

脚本
脚本用途
initstorage.py 初始化爬取存储
fetchpage.py
使用标准请求头下载页面 |

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

scraper网页数据提取

scraper

Scraper

Core Purpose

Safety Boundaries

Runtime Requirements

Local Storage

Key Workflows

ScriptsScriptPurposeINLINECODE7Initialize scraper storageINLINECODE8 Download a page with standard headers |

Scraper

核心用途

安全边界

运行环境要求

本地存储

关键工作流程

脚本脚本用途initstorage.py初始化爬取存储fetchpage.py 使用标准请求头下载页面 |

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement

Scripts
Script Purpose
INLINECODE7 Initialize scraper storage
INLINECODE8
Download a page with standard headers |

脚本
脚本用途
initstorage.py 初始化爬取存储
fetchpage.py
使用标准请求头下载页面 |