Crawlee Skill
Crawlee is a production-grade web scraping and browser automation library for JavaScript/TypeScript (Node.js 16+)
and Python (3.10+). It handles anti-blocking, proxies, session management, storage, and concurrency out of the box.
Docs: https://crawlee.dev/js/docs | https://crawlee.dev/python/docs
GitHub: https://github.com/apify/crawlee
1. Choose Your Crawler
JavaScript / TypeScript
| Crawler | When to Use | JS Required |
|---|
| INLINECODE0 | Fast HTML parsing, no JS rendering needed | ❌ |
| INLINECODE1 |
Raw HTTP responses, custom parsing | ❌ |
|
JSDOMCrawler | DOM manipulation without full browser | ❌ |
|
PlaywrightCrawler | Modern headless browser (Chromium/Firefox/WebKit) | ✅ |
|
PuppeteerCrawler | Chromium/Chrome headless automation | ✅ |
|
AdaptivePlaywrightCrawler | Auto-detects if JS rendering is needed | Auto |
|
BasicCrawler | Custom HTTP logic from scratch | ❌ |
Rule of thumb: Start with CheerioCrawler. Upgrade to PlaywrightCrawler only when JS rendering is required.
Python
| Crawler | When to Use |
|---|
| INLINECODE9 | HTML parsing with BeautifulSoup (fast, no JS) |
| INLINECODE10 |
CSS/XPath selectors, Scrapy-style (fast, no JS) |
|
PlaywrightCrawler | Full browser automation (Chromium/Firefox/WebKit) |
|
AdaptivePlaywrightCrawler | Auto HTTP vs browser decision |
2. Installation
JavaScript
CODEBLOCK0
Add to package.json:
CODEBLOCK1
Python
pip install crawlee
# With BeautifulSoup:
pip install 'crawlee[beautifulsoup]'
# With Playwright:
pip install 'crawlee[playwright]'
playwright install
3. Core Concepts
The Two Questions Every Crawler Answers
- 1. Where to go? →
Request objects in a INLINECODE15 - What to do there? →
requestHandler function (JS) / decorated handler (Python)
Key Classes (JS)
- -
Request — A single URL + metadata to crawl - INLINECODE18 — Dynamic, deduplicated queue of URLs
- INLINECODE19 — Append-only structured result storage (like a table)
- INLINECODE20 — Blob storage for screenshots, PDFs, state
- INLINECODE21 — Manages proxy rotation
- INLINECODE22 — Manages browser sessions + cookies
4. Quick Start Examples
JavaScript — CheerioCrawler (Recommended Start)
CODEBLOCK3
JavaScript — PlaywrightCrawler
CODEBLOCK4
Python — BeautifulSoupCrawler
CODEBLOCK5
Python — PlaywrightCrawler
import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
async def main() -> None:
crawler = PlaywrightCrawler(headless=True, browser_type='chromium')
@crawler.router.default_handler
async def handler(context: PlaywrightCrawlingContext) -> None:
title = await context.page.title()
await context.push_data({'url': context.request.url, 'title': title})
await context.enqueue_links()
await crawler.run(['https://example.com'])
if __name__ == '__main__':
asyncio.run(main())
5. Routing — Handling Multiple Page Types
Use labels + router to handle different kinds of pages (list pages, detail pages, etc.).
JavaScript
CODEBLOCK7
CODEBLOCK8
Python
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
crawler = BeautifulSoupCrawler()
@crawler.router.handler('CATEGORY')
async def category_handler(context: BeautifulSoupCrawlingContext) -> None:
await context.enqueue_links(selector='a.product', label='DETAIL')
@crawler.router.default_handler
async def detail_handler(context: BeautifulSoupCrawlingContext) -> None:
title = context.soup.title.string
await context.push_data({'url': context.request.url, 'title': title})
6. Enqueuing Links
JavaScript — enqueueLinks()
CODEBLOCK10
Python
await context.enqueue_links()
await context.enqueue_links(selector='a.product', label='DETAIL')
await context.enqueue_links(include=[re.compile(r'/products/\d+')])
7. Storage
Dataset (structured results)
CODEBLOCK12
CODEBLOCK13
Data is saved to ./storage/datasets/default/*.json by default.
KeyValueStore (blobs, screenshots, state)
CODEBLOCK14
CODEBLOCK15
Storage location
CODEBLOCK16
Override with env var: CRAWLEE_STORAGE_DIR=/path/to/storage
8. Proxy Management
CODEBLOCK17
CODEBLOCK18
CODEBLOCK19
9. Session Management
Sessions tie together cookies, proxy IPs, and headers to simulate a consistent user identity.
CODEBLOCK20
CODEBLOCK21
10. Avoiding Blocks
CODEBLOCK22
Anti-blocking checklist:
- - ✅ Use
CheerioCrawler — it uses got-scraping which mimics real browser HTTP - ✅ Enable
useSessionPool: true with a INLINECODE29 - ✅ Use tiered proxies for automatic failover
- ✅ Set
maxRequestsPerMinute to avoid rate limits - ✅ For browser crawlers — fingerprints are rotated automatically
- ✅ Use INLINECODE31
- ✅ Retire sessions on blocks: INLINECODE32
11. Concurrency & Scaling
CODEBLOCK23
CODEBLOCK24
Scaling notes:
- - Crawlee auto-scales concurrency based on CPU/memory
- Don't set
minConcurrency high — it can crash under load - INLINECODE34 is smoother than raw concurrency throttling
12. Configuration & Environment Variables
| Env Variable | Default | Purpose |
|---|
| INLINECODE35 | INLINECODE36 | Storage root directory |
| INLINECODE37 |
default | Override default dataset ID |
|
CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID |
default | Override default KVS ID |
|
CRAWLEE_DEFAULT_REQUEST_QUEUE_ID |
default | Override default queue ID |
|
CRAWLEE_PURGE_ON_START |
true | Clear storage before each run |
CODEBLOCK25
13. Docker Deployment
CODEBLOCK26
For Cheerio (smaller image):
FROM apify/actor-node:20
14. Common Patterns
Pagination
CODEBLOCK28
Downloading Files
CODEBLOCK29
Taking Screenshots
CODEBLOCK30
Shared State Across Handlers
CODEBLOCK31
Error Handling & Retries
CODEBLOCK32
CODEBLOCK33
Sitemap Crawling
CODEBLOCK34
Run as Web Server
import { CheerioCrawler } from 'crawlee';
import { createServer } from 'http';
const server = createServer(async (req, res) => {
const url = new URL(req.url, 'http://localhost').searchParams.get('url');
const crawler = new CheerioCrawler({
maxRequestsPerCrawl: 1,
async requestHandler({ $ }) {
res.end(JSON.stringify({ title: $('title').text() }));
},
});
await crawler.run([url]);
});
server.listen(3000);
15. TypeScript Support
CODEBLOCK36
16. Cloud Deployment (Apify Platform)
CODEBLOCK37
Deploy with: apify push
17. Debugging Tips
CODEBLOCK38
18. Reference Files
For advanced topics, see:
- -
references/js-api.md — Full JS API quick reference - INLINECODE47 — Full Python API quick reference
Both language docs: https://crawlee.dev
Crawlee 技能
Crawlee 是一个用于 JavaScript/TypeScript(Node.js 16+)和 Python(3.10+)的生产级网页抓取和浏览器自动化库。它开箱即用地处理反封锁、代理、会话管理、存储和并发。
文档:https://crawlee.dev/js/docs | https://crawlee.dev/python/docs
GitHub:https://github.com/apify/crawlee
1. 选择你的爬虫
JavaScript / TypeScript
| 爬虫 | 使用场景 | 需要 JS |
|---|
| CheerioCrawler | 快速 HTML 解析,无需 JS 渲染 | ❌ |
| HttpCrawler |
原始 HTTP 响应,自定义解析 | ❌ |
| JSDOMCrawler | DOM 操作,无需完整浏览器 | ❌ |
| PlaywrightCrawler | 现代无头浏览器(Chromium/Firefox/WebKit) | ✅ |
| PuppeteerCrawler | Chromium/Chrome 无头自动化 | ✅ |
| AdaptivePlaywrightCrawler | 自动检测是否需要 JS 渲染 | 自动 |
| BasicCrawler | 从头开始自定义 HTTP 逻辑 | ❌ |
经验法则:从 CheerioCrawler 开始。仅在需要 JS 渲染时升级到 PlaywrightCrawler。
Python
| 爬虫 | 使用场景 |
|---|
| BeautifulSoupCrawler | 使用 BeautifulSoup 进行 HTML 解析(快速,无需 JS) |
| ParselCrawler |
CSS/XPath 选择器,Scrapy 风格(快速,无需 JS) |
| PlaywrightCrawler | 完整的浏览器自动化(Chromium/Firefox/WebKit) |
| AdaptivePlaywrightCrawler | 自动 HTTP 与浏览器决策 |
2. 安装
JavaScript
bash
推荐:使用 CLI
npx crawlee create my-crawler
cd my-crawler && npm install
或手动安装:
npm install crawlee
对于 Playwright:
npm install crawlee playwright
npx playwright install
对于 Puppeteer:
npm install crawlee puppeteer
添加到 package.json:
json
{ type: module }
Python
bash
pip install crawlee
使用 BeautifulSoup:
pip install crawlee[beautifulsoup]
使用 Playwright:
pip install crawlee[playwright]
playwright install
3. 核心概念
每个爬虫回答的两个问题
- 1. 去哪里? → RequestQueue 中的 Request 对象
- 在那里做什么? → requestHandler 函数(JS)/ 装饰器处理函数(Python)
关键类(JS)
- - Request — 单个 URL + 要抓取的元数据
- RequestQueue — 动态、去重的 URL 队列
- Dataset — 仅追加的结构化结果存储(类似表格)
- KeyValueStore — 用于截图、PDF、状态的 Blob 存储
- ProxyConfiguration — 管理代理轮换
- SessionPool — 管理浏览器会话 + Cookie
4. 快速入门示例
JavaScript — CheerioCrawler(推荐入门)
javascript
import { CheerioCrawler, Dataset } from crawlee;
const crawler = new CheerioCrawler({
async requestHandler({ $, request, enqueueLinks, log }) {
const title = $(title).text();
log.info(Title of ${request.loadedUrl}: ${title});
await Dataset.pushData({ url: request.loadedUrl, title });
// 入队此页面上找到的所有链接
await enqueueLinks();
},
maxRequestsPerCrawl: 100, // 安全限制
});
await crawler.run([https://example.com]);
JavaScript — PlaywrightCrawler
javascript
import { PlaywrightCrawler, Dataset } from crawlee;
const crawler = new PlaywrightCrawler({
// headless: false, // 取消注释以查看浏览器
async requestHandler({ page, request, enqueueLinks, log }) {
const title = await page.title();
log.info(${request.loadedUrl}: ${title});
await Dataset.pushData({ url: request.loadedUrl, title });
await enqueueLinks();
},
});
await crawler.run([https://example.com]);
Python — BeautifulSoupCrawler
python
import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
async def main() -> None:
crawler = BeautifulSoupCrawler(maxrequestsper_crawl=50)
@crawler.router.default_handler
async def handler(context: BeautifulSoupCrawlingContext) -> None:
title = context.soup.title.string if context.soup.title else None
context.log.info(fProcessing {context.request.url}: {title})
await context.push_data({url: context.request.url, title: title})
await context.enqueue_links()
await crawler.run([https://example.com])
if name == main:
asyncio.run(main())
Python — PlaywrightCrawler
python
import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
async def main() -> None:
crawler = PlaywrightCrawler(headless=True, browser_type=chromium)
@crawler.router.default_handler
async def handler(context: PlaywrightCrawlingContext) -> None:
title = await context.page.title()
await context.push_data({url: context.request.url, title: title})
await context.enqueue_links()
await crawler.run([https://example.com])
if name == main:
asyncio.run(main())
5. 路由 — 处理多种页面类型
使用标签 + 路由器处理不同类型的页面(列表页、详情页等)。
JavaScript
javascript
import { PlaywrightCrawler, Dataset } from crawlee;
import { router } from ./routes.js;
const crawler = new PlaywrightCrawler({ requestHandler: router });
await crawler.run([{ url: https://shop.example.com, label: START }]);
javascript
// routes.js
import { createPlaywrightRouter } from crawlee;
export const router = createPlaywrightRouter();
router.addHandler(START, async ({ page, enqueueLinks }) => {
await enqueueLinks({ selector: a.category, label: CATEGORY });
});
router.addHandler(CATEGORY, async ({ page, enqueueLinks }) => {
await enqueueLinks({ selector: a.product, label: DETAIL });
// 入队下一页
const next = await page.$(a.next-page);
if (next) await enqueueLinks({ selector: a.next-page, label: CATEGORY });
});
router.addDefaultHandler(async ({ page, request, pushData }) => {
// DETAIL 页面
const title = await page.title();
const price = await page.$eval(.price, el => el.textContent);
await pushData({ url: request.url, title, price });
});
Python
python
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
crawler = BeautifulSoupCrawler()
@crawler.router.handler(CATEGORY)
async def category_handler(context: BeautifulSoupCrawlingContext) -> None:
await context.enqueue_links(selector=a.product, label=DETAIL)
@crawler.router.default_handler
async def detail_handler(context: BeautifulSoupCrawlingContext) -> None:
title = context.soup.title.string
await context.push_data({url: context.request.url, title: title})
6. 入队链接
JavaScript — enqueueLinks()
javascript
// 入队页面上的所有链接
await enqueueLinks();
// 按 glob 模式过滤
await enqueueLinks({ globs: [https://example.com/products/] });
// 按正则表达式过滤
await enqueueLinks({ regexps: [/\/product\/\d+/] });
// 仅入队特定选择器
await enqueueLinks({ selector: a.pagination, label: LIST });
// 使用自定义标签和转换入队
await enqueueLinks({
selector: a.item,
label: DETAIL,
transformRequestFunction: (req) => {
req.userData.scrapedAt = new Date().toISOString();
return req;
},
});
Python
python
await context.enqueue_links()
await context.enqueue_links(selector=a.product,