Crawlee Skill

Crawlee is a production-grade web scraping and browser automation library for JavaScript/TypeScript (Node.js 16+)
and Python (3.10+). It handles anti-blocking, proxies, session management, storage, and concurrency out of the box.

Docs: https://crawlee.dev/js/docs | https://crawlee.dev/python/docs
GitHub: https://github.com/apify/crawlee

1. Choose Your Crawler

JavaScript / TypeScript

Crawler	When to Use	JS Required
INLINECODE0	Fast HTML parsing, no JS rendering needed	❌
INLINECODE1

Rule of thumb: Start with CheerioCrawler. Upgrade to PlaywrightCrawler only when JS rendering is required.

Python

Crawler	When to Use
INLINECODE9	HTML parsing with BeautifulSoup (fast, no JS)
INLINECODE10

2. Installation

JavaScript

CODEBLOCK0

Add to package.json:
CODEBLOCK1

Python

pip install crawlee

# With BeautifulSoup:
pip install 'crawlee[beautifulsoup]'

# With Playwright:
pip install 'crawlee[playwright]'
playwright install

3. Core Concepts

The Two Questions Every Crawler Answers

1. Where to go? → Request objects in a INLINECODE15
What to do there? → requestHandler function (JS) / decorated handler (Python)

Key Classes (JS)

- Request — A single URL + metadata to crawl
INLINECODE18 — Dynamic, deduplicated queue of URLs
INLINECODE19 — Append-only structured result storage (like a table)
INLINECODE20 — Blob storage for screenshots, PDFs, state
INLINECODE21 — Manages proxy rotation
INLINECODE22 — Manages browser sessions + cookies

4. Quick Start Examples

JavaScript — CheerioCrawler (Recommended Start)

CODEBLOCK3

JavaScript — PlaywrightCrawler

CODEBLOCK4

Python — BeautifulSoupCrawler

CODEBLOCK5

Python — PlaywrightCrawler

import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext

async def main() -> None:
    crawler = PlaywrightCrawler(headless=True, browser_type='chromium')

    @crawler.router.default_handler
    async def handler(context: PlaywrightCrawlingContext) -> None:
        title = await context.page.title()
        await context.push_data({'url': context.request.url, 'title': title})
        await context.enqueue_links()

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    asyncio.run(main())

5. Routing — Handling Multiple Page Types

Use labels + router to handle different kinds of pages (list pages, detail pages, etc.).

JavaScript

CODEBLOCK7

CODEBLOCK8

Python

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

crawler = BeautifulSoupCrawler()

@crawler.router.handler('CATEGORY')
async def category_handler(context: BeautifulSoupCrawlingContext) -> None:
    await context.enqueue_links(selector='a.product', label='DETAIL')

@crawler.router.default_handler
async def detail_handler(context: BeautifulSoupCrawlingContext) -> None:
    title = context.soup.title.string
    await context.push_data({'url': context.request.url, 'title': title})

6. Enqueuing Links

JavaScript — `enqueueLinks()`

CODEBLOCK10

Python

await context.enqueue_links()
await context.enqueue_links(selector='a.product', label='DETAIL')
await context.enqueue_links(include=[re.compile(r'/products/\d+')])

7. Storage

Dataset (structured results)

CODEBLOCK12

CODEBLOCK13

Data is saved to ./storage/datasets/default/*.json by default.

KeyValueStore (blobs, screenshots, state)

CODEBLOCK14

CODEBLOCK15

Storage location

CODEBLOCK16

Override with env var: CRAWLEE_STORAGE_DIR=/path/to/storage

8. Proxy Management

CODEBLOCK17

CODEBLOCK18

CODEBLOCK19

9. Session Management

Sessions tie together cookies, proxy IPs, and headers to simulate a consistent user identity.

CODEBLOCK20

CODEBLOCK21

10. Avoiding Blocks

CODEBLOCK22

Anti-blocking checklist:

- ✅ Use CheerioCrawler — it uses got-scraping which mimics real browser HTTP
✅ Enable useSessionPool: true with a INLINECODE29
✅ Use tiered proxies for automatic failover
✅ Set maxRequestsPerMinute to avoid rate limits
✅ For browser crawlers — fingerprints are rotated automatically
✅ Use INLINECODE31
✅ Retire sessions on blocks: INLINECODE32

11. Concurrency & Scaling

CODEBLOCK23

CODEBLOCK24

Scaling notes:

- Crawlee auto-scales concurrency based on CPU/memory
Don't set minConcurrency high — it can crash under load
INLINECODE34 is smoother than raw concurrency throttling

12. Configuration & Environment Variables

Env Variable	Default	Purpose
INLINECODE35	INLINECODE36	Storage root directory
INLINECODE37

CODEBLOCK25

13. Docker Deployment

CODEBLOCK26

For Cheerio (smaller image):

FROM apify/actor-node:20

14. Common Patterns

Pagination

CODEBLOCK28

Downloading Files

CODEBLOCK29

Taking Screenshots

CODEBLOCK30

Shared State Across Handlers

CODEBLOCK31

Error Handling & Retries

CODEBLOCK32

CODEBLOCK33

Sitemap Crawling

CODEBLOCK34

Run as Web Server

import { CheerioCrawler } from 'crawlee';
import { createServer } from 'http';

const server = createServer(async (req, res) => {
  const url = new URL(req.url, 'http://localhost').searchParams.get('url');
  const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: 1,
    async requestHandler({ $ }) {
      res.end(JSON.stringify({ title: $('title').text() }));
    },
  });
  await crawler.run([url]);
});
server.listen(3000);

15. TypeScript Support

CODEBLOCK36

16. Cloud Deployment (Apify Platform)

CODEBLOCK37

Deploy with: apify push

17. Debugging Tips

CODEBLOCK38

18. Reference Files

For advanced topics, see:

- references/js-api.md — Full JS API quick reference
INLINECODE47 — Full Python API quick reference

Both language docs: https://crawlee.dev

Crawlee 技能

Crawlee 是一个用于 JavaScript/TypeScript（Node.js 16+）和 Python（3.10+）的生产级网页抓取和浏览器自动化库。它开箱即用地处理反封锁、代理、会话管理、存储和并发。

文档：https://crawlee.dev/js/docs | https://crawlee.dev/python/docs
GitHub：https://github.com/apify/crawlee

1. 选择你的爬虫

JavaScript / TypeScript

爬虫	使用场景	需要 JS
CheerioCrawler	快速 HTML 解析，无需 JS 渲染	❌
HttpCrawler

经验法则：从 CheerioCrawler 开始。仅在需要 JS 渲染时升级到 PlaywrightCrawler。

Python

爬虫	使用场景
BeautifulSoupCrawler	使用 BeautifulSoup 进行 HTML 解析（快速，无需 JS）
ParselCrawler

2. 安装

JavaScript

bash

推荐：使用 CLI

npx crawlee create my-crawler cd my-crawler && npm install

或手动安装：

npm install crawlee

对于 Playwright：

npm install crawlee playwright npx playwright install

对于 Puppeteer：

npm install crawlee puppeteer

添加到 package.json：
json
{ type: module }

Python

bash pip install crawlee

使用 BeautifulSoup：

pip install crawlee[beautifulsoup]

使用 Playwright：

pip install crawlee[playwright] playwright install

3. 核心概念

每个爬虫回答的两个问题

1. 去哪里？ → RequestQueue 中的 Request 对象
在那里做什么？ → requestHandler 函数（JS）/ 装饰器处理函数（Python）

关键类（JS）

- Request — 单个 URL + 要抓取的元数据
RequestQueue — 动态、去重的 URL 队列
Dataset — 仅追加的结构化结果存储（类似表格）
KeyValueStore — 用于截图、PDF、状态的 Blob 存储
ProxyConfiguration — 管理代理轮换
SessionPool — 管理浏览器会话 + Cookie

4. 快速入门示例

JavaScript — CheerioCrawler（推荐入门）

javascript import { CheerioCrawler, Dataset } from crawlee;

const crawler = new CheerioCrawler({
async requestHandler({ $, request, enqueueLinks, log }) {
const title = $(title).text();
log.info(Title of ${request.loadedUrl}: ${title});

await Dataset.pushData({ url: request.loadedUrl, title });

// 入队此页面上找到的所有链接
await enqueueLinks();
},
maxRequestsPerCrawl: 100, // 安全限制
});

await crawler.run([https://example.com]);

JavaScript — PlaywrightCrawler

javascript import { PlaywrightCrawler, Dataset } from crawlee;

await crawler.run([https://example.com]);

Python — BeautifulSoupCrawler

python import asyncio from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

async def main() -> None:
crawler = BeautifulSoupCrawler(maxrequestsper_crawl=50)

@crawler.router.default_handler
async def handler(context: BeautifulSoupCrawlingContext) -> None:
title = context.soup.title.string if context.soup.title else None
context.log.info(fProcessing {context.request.url}: {title})
await context.push_data({url: context.request.url, title: title})
await context.enqueue_links()

await crawler.run([https://example.com])

if name == main:
asyncio.run(main())

Python — PlaywrightCrawler

python import asyncio from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext

async def main() -> None:
crawler = PlaywrightCrawler(headless=True, browser_type=chromium)

@crawler.router.default_handler
async def handler(context: PlaywrightCrawlingContext) -> None:
title = await context.page.title()
await context.push_data({url: context.request.url, title: title})
await context.enqueue_links()

await crawler.run([https://example.com])

if name == main:
asyncio.run(main())

5. 路由 — 处理多种页面类型

使用标签 + 路由器处理不同类型的页面（列表页、详情页等）。

JavaScript

javascript import { PlaywrightCrawler, Dataset } from crawlee; import { router } from ./routes.js;

const crawler = new PlaywrightCrawler({ requestHandler: router });

await crawler.run([{ url: https://shop.example.com, label: START }]);

javascript
// routes.js
import { createPlaywrightRouter } from crawlee;

export const router = createPlaywrightRouter();

router.addHandler(START, async ({ page, enqueueLinks }) => {
await enqueueLinks({ selector: a.category, label: CATEGORY });
});

router.addHandler(CATEGORY, async ({ page, enqueueLinks }) => {
await enqueueLinks({ selector: a.product, label: DETAIL });
// 入队下一页
const next = await page.$(a.next-page);
if (next) await enqueueLinks({ selector: a.next-page, label: CATEGORY });
});

router.addDefaultHandler(async ({ page, request, pushData }) => {
// DETAIL 页面
const title = await page.title();
const price = await page.$eval(.price, el => el.textContent);
await pushData({ url: request.url, title, price });
});

Python

python from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

crawler = BeautifulSoupCrawler()

@crawler.router.handler(CATEGORY)
async def category_handler(context: BeautifulSoupCrawlingContext) -> None:
await context.enqueue_links(selector=a.product, label=DETAIL)

@crawler.router.default_handler
async def detail_handler(context: BeautifulSoupCrawlingContext) -> None:
title = context.soup.title.string
await context.push_data({url: context.request.url, title: title})

6. 入队链接

JavaScript — enqueueLinks()

javascript // 入队页面上的所有链接 await enqueueLinks();

// 按 glob 模式过滤
await enqueueLinks({ globs: [https://example.com/products/] });

// 按正则表达式过滤
await enqueueLinks({ regexps: [/\/product\/\d+/] });

// 仅入队特定选择器
await enqueueLinks({ selector: a.pagination, label: LIST });

// 使用自定义标签和转换入队
await enqueueLinks({
selector: a.item,
label: DETAIL,
transformRequestFunction: (req) => {
req.userData.scrapedAt = new Date().toISOString();
return req;
},
});

Python

python await context.enqueue_links() await context.enqueue_links(selector=a.product,

crawlee爬虫框架

crawlee

Crawlee Skill

1. Choose Your Crawler

JavaScript / TypeScript

Python

2. Installation

JavaScript

Python

3. Core Concepts

The Two Questions Every Crawler Answers

Key Classes (JS)

4. Quick Start Examples

JavaScript — CheerioCrawler (Recommended Start)

JavaScript — PlaywrightCrawler

Python — BeautifulSoupCrawler

Python — PlaywrightCrawler

5. Routing — Handling Multiple Page Types

JavaScript

Python

6. Enqueuing Links

JavaScript — enqueueLinks()

Python

7. Storage

Dataset (structured results)

KeyValueStore (blobs, screenshots, state)

Storage location

8. Proxy Management

9. Session Management

10. Avoiding Blocks

11. Concurrency & Scaling

12. Configuration & Environment Variables

13. Docker Deployment

14. Common Patterns

Pagination

Downloading Files

Taking Screenshots

Shared State Across Handlers

Error Handling & Retries

Sitemap Crawling

Run as Web Server

15. TypeScript Support

16. Cloud Deployment (Apify Platform)

17. Debugging Tips

18. Reference Files

Crawlee 技能

1. 选择你的爬虫

JavaScript / TypeScript

Python

2. 安装

JavaScript

推荐：使用 CLI

或手动安装：

对于 Playwright：

对于 Puppeteer：

Python

使用 BeautifulSoup：

使用 Playwright：

3. 核心概念

每个爬虫回答的两个问题

关键类（JS）

4. 快速入门示例

JavaScript — CheerioCrawler（推荐入门）

JavaScript — PlaywrightCrawler

Python — BeautifulSoupCrawler

Python — PlaywrightCrawler

5. 路由 — 处理多种页面类型

JavaScript

Python

6. 入队链接

JavaScript — enqueueLinks()

Python

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载

相关推荐

self-improvement

JavaScript — `enqueueLinks()`