返回顶部
c

crawlee爬虫框架

>

作者: admin | 来源: ClawHub
源自
ClawHub
版本
V 1.0.0
安全检测
已通过
399
下载量
免费
免费
0
收藏
概述
安装方式
版本历史

crawlee

Crawlee 技能

Crawlee 是一个用于 JavaScript/TypeScript(Node.js 16+)和 Python(3.10+)的生产级网页抓取和浏览器自动化库。它开箱即用地处理反封锁、代理、会话管理、存储和并发。

文档:https://crawlee.dev/js/docs | https://crawlee.dev/python/docs
GitHub:https://github.com/apify/crawlee


1. 选择你的爬虫

JavaScript / TypeScript

爬虫使用场景需要 JS
CheerioCrawler快速 HTML 解析,无需 JS 渲染
HttpCrawler
原始 HTTP 响应,自定义解析 | ❌ | | JSDOMCrawler | DOM 操作,无需完整浏览器 | ❌ | | PlaywrightCrawler | 现代无头浏览器(Chromium/Firefox/WebKit) | ✅ | | PuppeteerCrawler | Chromium/Chrome 无头自动化 | ✅ | | AdaptivePlaywrightCrawler | 自动检测是否需要 JS 渲染 | 自动 | | BasicCrawler | 从头开始自定义 HTTP 逻辑 | ❌ |

经验法则:从 CheerioCrawler 开始。仅在需要 JS 渲染时升级到 PlaywrightCrawler。

Python

爬虫使用场景
BeautifulSoupCrawler使用 BeautifulSoup 进行 HTML 解析(快速,无需 JS)
ParselCrawler
CSS/XPath 选择器,Scrapy 风格(快速,无需 JS) | | PlaywrightCrawler | 完整的浏览器自动化(Chromium/Firefox/WebKit) | | AdaptivePlaywrightCrawler | 自动 HTTP 与浏览器决策 |

2. 安装

JavaScript

bash

推荐:使用 CLI

npx crawlee create my-crawler cd my-crawler && npm install

或手动安装:

npm install crawlee

对于 Playwright:

npm install crawlee playwright npx playwright install

对于 Puppeteer:

npm install crawlee puppeteer

添加到 package.json:
json
{ type: module }

Python

bash pip install crawlee

使用 BeautifulSoup:

pip install crawlee[beautifulsoup]

使用 Playwright:

pip install crawlee[playwright] playwright install

3. 核心概念

每个爬虫回答的两个问题

  1. 1. 去哪里? → RequestQueue 中的 Request 对象
  2. 在那里做什么? → requestHandler 函数(JS)/ 装饰器处理函数(Python)

关键类(JS)

  • - Request — 单个 URL + 要抓取的元数据
  • RequestQueue — 动态、去重的 URL 队列
  • Dataset — 仅追加的结构化结果存储(类似表格)
  • KeyValueStore — 用于截图、PDF、状态的 Blob 存储
  • ProxyConfiguration — 管理代理轮换
  • SessionPool — 管理浏览器会话 + Cookie

4. 快速入门示例

JavaScript — CheerioCrawler(推荐入门)

javascript import { CheerioCrawler, Dataset } from crawlee;

const crawler = new CheerioCrawler({
async requestHandler({ $, request, enqueueLinks, log }) {
const title = $(title).text();
log.info(Title of ${request.loadedUrl}: ${title});

await Dataset.pushData({ url: request.loadedUrl, title });

// 入队此页面上找到的所有链接
await enqueueLinks();
},
maxRequestsPerCrawl: 100, // 安全限制
});

await crawler.run([https://example.com]);

JavaScript — PlaywrightCrawler

javascript import { PlaywrightCrawler, Dataset } from crawlee;

const crawler = new PlaywrightCrawler({
// headless: false, // 取消注释以查看浏览器
async requestHandler({ page, request, enqueueLinks, log }) {
const title = await page.title();
log.info(${request.loadedUrl}: ${title});
await Dataset.pushData({ url: request.loadedUrl, title });
await enqueueLinks();
},
});

await crawler.run([https://example.com]);

Python — BeautifulSoupCrawler

python import asyncio from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

async def main() -> None:
crawler = BeautifulSoupCrawler(maxrequestsper_crawl=50)

@crawler.router.default_handler
async def handler(context: BeautifulSoupCrawlingContext) -> None:
title = context.soup.title.string if context.soup.title else None
context.log.info(fProcessing {context.request.url}: {title})
await context.push_data({url: context.request.url, title: title})
await context.enqueue_links()

await crawler.run([https://example.com])

if name == main:
asyncio.run(main())

Python — PlaywrightCrawler

python import asyncio from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext

async def main() -> None:
crawler = PlaywrightCrawler(headless=True, browser_type=chromium)

@crawler.router.default_handler
async def handler(context: PlaywrightCrawlingContext) -> None:
title = await context.page.title()
await context.push_data({url: context.request.url, title: title})
await context.enqueue_links()

await crawler.run([https://example.com])

if name == main:
asyncio.run(main())



5. 路由 — 处理多种页面类型

使用标签 + 路由器处理不同类型的页面(列表页、详情页等)。

JavaScript

javascript import { PlaywrightCrawler, Dataset } from crawlee; import { router } from ./routes.js;

const crawler = new PlaywrightCrawler({ requestHandler: router });

await crawler.run([{ url: https://shop.example.com, label: START }]);

javascript
// routes.js
import { createPlaywrightRouter } from crawlee;

export const router = createPlaywrightRouter();

router.addHandler(START, async ({ page, enqueueLinks }) => {
await enqueueLinks({ selector: a.category, label: CATEGORY });
});

router.addHandler(CATEGORY, async ({ page, enqueueLinks }) => {
await enqueueLinks({ selector: a.product, label: DETAIL });
// 入队下一页
const next = await page.$(a.next-page);
if (next) await enqueueLinks({ selector: a.next-page, label: CATEGORY });
});

router.addDefaultHandler(async ({ page, request, pushData }) => {
// DETAIL 页面
const title = await page.title();
const price = await page.$eval(.price, el => el.textContent);
await pushData({ url: request.url, title, price });
});

Python

python from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

crawler = BeautifulSoupCrawler()

@crawler.router.handler(CATEGORY)
async def category_handler(context: BeautifulSoupCrawlingContext) -> None:
await context.enqueue_links(selector=a.product, label=DETAIL)

@crawler.router.default_handler
async def detail_handler(context: BeautifulSoupCrawlingContext) -> None:
title = context.soup.title.string
await context.push_data({url: context.request.url, title: title})



6. 入队链接

JavaScript — enqueueLinks()

javascript // 入队页面上的所有链接 await enqueueLinks();

// 按 glob 模式过滤
await enqueueLinks({ globs: [https://example.com/products/] });

// 按正则表达式过滤
await enqueueLinks({ regexps: [/\/product\/\d+/] });

// 仅入队特定选择器
await enqueueLinks({ selector: a.pagination, label: LIST });

// 使用自定义标签和转换入队
await enqueueLinks({
selector: a.item,
label: DETAIL,
transformRequestFunction: (req) => {
req.userData.scrapedAt = new Date().toISOString();
return req;
},
});

Python

python await context.enqueue_links() await context.enqueue_links(selector=a.product,

标签

skill ai

通过对话安装

该技能支持在以下平台通过对话安装:

OpenClaw WorkBuddy QClaw Kimi Claude

方式一:安装 SkillHub 和技能

帮我安装 SkillHub 和 crawlee-1776152222 技能

方式二:设置 SkillHub 为优先技能安装源

设置 SkillHub 为我的优先技能安装源,然后帮我安装 crawlee-1776152222 技能

通过命令行安装

skillhub install crawlee-1776152222

下载

⬇ 下载 crawlee v1.0.0(免费)

文件大小: 11.7 KB | 发布时间: 2026-4-15 13:31

v1.0.0 最新 2026-4-15 13:31
Initial release: Expert guide for building web scrapers and crawlers using Crawlee (JS/TS and Python)

Archiver·手机版·闲社网·闲社论坛·羊毛社区· 多链控股集团有限公司 · 苏ICP备2025199260号-1

Powered by Discuz! X5.0   © 2024-2025 闲社网·线报更新论坛·羊毛分享社区·http://xianshe.com

p2p_official_large
返回顶部