Scrapling - Adaptive Web Scraping

"Effortless web scraping for the modern web."

Credits

Core Library

- Repository: https://github.com/D4Vinci/Scrapling
Author: D4Vinci (Karim Shoair)
License: BSD-3-Clause
Documentation: https://scrapling.readthedocs.io

API Reverse Engineering Methodology

- GitHub: https://github.com/paoloanzn/free-solscan-api
X Post: https://x.com/paoloanzn/status/2026361234032046319
Author: @paoloanzn
Insight: "Web scraping is 80% reverse engineering"

Installation

CODEBLOCK0

Agent Instructions

When to Use Scrapling

Use Scrapling when:

- Research topics from websites
Extract data from blogs, news sites, docs
Crawl multiple pages with Spider
Gather content for summaries
Extract brand data from any website
Reverse engineer APIs from websites

Do NOT use for:

- X/Twitter (use x-tweet-fetcher skill)
Login-protected sites (unless credentials provided)
Paywalled content (respect robots.txt)
Sites that prohibit scraping in their TOS

Quick Commands

1. Basic Fetch (Most Common)

CODEBLOCK1

2. Stealthy Fetch (Anti-Bot/Cloudflare)

CODEBLOCK2

3. Dynamic Fetch (Full Browser Automation)

CODEBLOCK3

4. Adaptive Parsing (Survives Design Changes)

CODEBLOCK4

5. Spider (Multiple Pages)

CODEBLOCK5

6. CLI Usage

CODEBLOCK6

Common Patterns

Extract Article Content

CODEBLOCK7

Research Multiple Pages

CODEBLOCK8

Crawl Entire Site (Easy Mode)

Auto-crawl all pages on a domain by following internal links:

CODEBLOCK9

Sitemap Crawl

Crawl pages from sitemap.xml (with fallback to link discovery):

CODEBLOCK10

Firecrawl-Style Crawl (Best of Both Worlds)

Inspired by Firecrawl's behavior - combines sitemap discovery with link following:

CODEBLOCK11

Handle Errors

CODEBLOCK12

Session Management

CODEBLOCK13

Multiple Session Types in Spider

CODEBLOCK14

Advanced Parsing & Navigation

CODEBLOCK15

Advanced: API Reverse Engineering

"Web scraping is 80% reverse engineering."

This section covers advanced techniques to discover and replicate APIs directly from websites — often revealing data that's "hidden" behind paid APIs.

1. API Endpoint Discovery

Many websites load data via client-side requests. Use browser DevTools to find them:

Steps:

1. Open browser DevTools (F12)
Go to Network tab
Reload the page
Look for XHR or Fetch requests
Check if endpoints return JSON data

What to look for:

- Requests to /api/* endpoints
Responses containing structured data (JSON)
Same endpoints used on both free and paid sections

Example pattern:
CODEBLOCK16

2. JavaScript Analysis

Auth tokens often generated client-side. Find them in .js files:

Steps:

1. In Network tab, look at Initiator column
Click the .js file making the request
Search for auth header name (e.g., sol-aut, Authorization, X-API-Key)
Find the function generating the token

Common patterns:

- Plain text function names: generateToken(), INLINECODE8
Obfuscated: Search for the header name directly
Random string generation: Math.random(), INLINECODE10

3. Replicating Discovered APIs

Once you've found the endpoint and auth pattern:

CODEBLOCK17

4. Cloudscraper Bypass (Cloudflare)

For Cloudflare-protected endpoints, use cloudscraper:

CODEBLOCK18

CODEBLOCK19

5. Complete API Replication Pattern

CODEBLOCK20

6. Discovery Checklist

When approaching a new site:

Step	Action	Tool
1	Open DevTools Network tab	F12
2

Brand Data Extraction (Firecrawl Alternative)

Extract brand data, colors, logos, and copy from any website:

CODEBLOCK21

Brand Data CLI

CODEBLOCK22

Feature Comparison

Feature	Status	Notes
Basic fetch	✅ Working	Fetcher.get()
Stealthy fetch

Examples Tested

IEEE Spectrum

page = Fetcher.get('https://spectrum.ieee.org/...')
title = page.css('h1::text').get()
content = page.css('article p::text').getall()

✅ Works

Hacker News

page = Fetcher.get('https://news.ycombinator.com')
stories = page.css('.titleline a::text').getall()

✅ Works

Example Domain

page = Fetcher.get('https://example.com')
title = page.css('h1::text').get()

✅ Works

🔧 Quick Troubleshooting

Issue	Solution
403/429 Blocked	Use StealthyFetcher or cloudscraper
Cloudflare

Skill Graph

Related skills:

- [[content-research]] - Research workflow
[[blogwatcher]] - RSS/feed monitoring
[[youtube-watcher]] - Video content
[[chirp]] - Twitter/X interactions
[[newsletter-digest]] - Content summarization
[[x-tweet-fetcher]] - X/Twitter (use instead of Scrapling)

Changelog

v1.0.8 (2026-02-25)

- Added: Firecrawl-Style Crawl - Combines sitemap discovery + link following
Added: use_sitemap parameter - Matches Firecrawl's sitemap:"include"/"skip" behavior
Verified: cloudflare.com returns 2,447 URLs from sitemap!

v1.0.7 (2026-02-25)

- Fixed: EasyCrawl Spider syntax - Updated to work with scrapling's actual Spider API
Verified: Spider crawling works - Tested and crawled 20+ pages from example.com

v1.0.6 (2026-02-25)

- Added: Easy Site Crawl - Auto-crawl all pages on a domain with EasyCrawl spider
Added: Sitemap Crawl - Extract URLs from sitemap.xml and crawl them
Feature parity with Firecrawl for site crawling capabilities

v1.0.5 (2026-02-25)

- Enhanced: API Reverse Engineering methodology

- Detailed step-by-step process from @paoloanzn's work - Real Solscan case study with exact timeline - Added: Step-by-step methodology section - Added: Real example documentation (Solscan March 2025 vs Feb 2026) - Added: Discovery checklist with 10 steps - Documented: How to find auth headers in JS files - Documented: Token generation pattern extraction - Updated: Cloudscraper integration with multi-attempt pattern - Verified: Solscan now patched (Cloudflare on both endpoints)

v1.0.4 (2026-02-25)

- Fixed: Brand Data Extraction API - Corrected selectors for scrapling's Response object
Fixed .html → .text / INLINECODE14
Fixed .title() → INLINECODE16
Fixed .logo img::src → INLINECODE18
Tested and verified working

v1.0.3 (2026-02-25)

- Added: API Reverse Engineering section

- API Endpoint Discovery (Network tab analysis) - JavaScript Analysis (finding auth logic) - Cloudscraper integration for Cloudflare bypass - Complete APIReplicator class - Discovery checklist

- Added cloudscraper to installation

v1.0.2 (2026-02-25)

- Synced with upstream GitHub README exactly
Added Brand Data Extraction section
Clean, core-only version

v1.0.1 (2026-02-25)

- Synced with original Scrapling GitHub README

Last updated: 2026-02-25

Scrapling - 自适应网页抓取

为现代网页提供轻松的网页抓取体验。

致谢

核心库

- 仓库地址： https://github.com/D4Vinci/Scrapling
作者： D4Vinci (Karim Shoair)
许可证： BSD-3-Clause
文档： https://scrapling.readthedocs.io

API逆向工程方法论

- GitHub： https://github.com/paoloanzn/free-solscan-api
X帖子： https://x.com/paoloanzn/status/2026361234032046319
作者： @paoloanzn
洞见： 网页抓取80%是逆向工程

安装

bash

核心库（仅解析器）

pip install scrapling

带获取器（HTTP + 浏览器自动化）- 推荐

pip install scrapling[fetchers] scrapling install

带Shell（CLI工具）- 推荐

pip install scrapling[shell]

带AI（MCP服务器）- 可选

pip install scrapling[ai]

全部安装

pip install scrapling[all]

用于隐身/动态模式的浏览器

playwright install chromium

用于Cloudflare绕过（高级）

pip install cloudscraper

代理指令

何时使用Scrapling

使用Scrapling的场景：

- 从网站研究主题
从博客、新闻网站、文档中提取数据
使用Spider爬取多个页面
收集内容用于摘要
从任何网站提取品牌数据
从网站逆向工程API

不要用于：

- X/Twitter（使用x-tweet-fetcher技能）
需要登录的网站（除非提供了凭证）
付费墙内容（遵守robots.txt）
在其服务条款中禁止抓取的网站

快速命令

1. 基本获取（最常用）

python
from scrapling.fetchers import Fetcher

page = Fetcher.get(https://example.com)

提取内容

title = page.css(h1::text).get() paragraphs = page.css(p::text).getall()

2. 隐身获取（反机器人/Cloudflare）

python
from scrapling.fetchers import StealthyFetcher

StealthyFetcher.adaptive = True
page = StealthyFetcher.fetch(https://example.com, headless=True, solve_cloudflare=True)

3. 动态获取（完整浏览器自动化）

python
from scrapling.fetchers import DynamicFetcher

page = DynamicFetcher.fetch(https://example.com, headless=True, network_idle=True)

4. 自适应解析（应对设计变更）

python
from scrapling.fetchers import Fetcher

page = Fetcher.get(https://example.com)

首次抓取 - 保存选择器

items = page.css(.product, auto_save=True)

之后 - 如果网站变更，使用adaptive=True重新定位

items = page.css(.product, adaptive=True)

5. Spider（多页面）

python
from scrapling.spiders import Spider, Response

class MySpider(Spider):
name = demo
start_urls = [https://example.com]
concurrent_requests = 3

async def parse(self, response: Response):
for item in response.css(.item):
yield {item: item.css(h2::text).get()}

# 跟随链接
next_page = response.css(.next a)
if next_page:
yield response.follow(next_page[0].attrib[href])

MySpider().start()

6. CLI使用

bash

简单获取到文件

scrapling extract get https://example.com content.html

隐身获取（绕过反机器人）

scrapling extract stealthy-fetch https://example.com content.html

交互式Shell

scrapling shell https://example.com

常见模式

提取文章内容

python
from scrapling.fetchers import Fetcher

page = Fetcher.get(https://example.com/article)

尝试多个选择器获取标题

title = ( page.css([itemprop=headline]::text).get() or page.css(article h1::text).get() or page.css(h1::text).get() )

获取段落

content = page.css(article p::text, .article-body p::text).getall()

print(f标题：{title})
print(f段落数：{len(content)})

研究多个页面

python
from scrapling.spiders import Spider, Response

class ResearchSpider(Spider):
name = research
start_urls = [https://news.ycombinator.com]
concurrent_requests = 5

async def parse(self, response: Response):
for item in response.css(.titleline a::text).getall()[:10]:
yield {title: item, source: HN}

more = response.css(.morelink::attr(href)).get()
if more:
yield response.follow(more)

ResearchSpider().start()

爬取整个网站（简易模式）

通过跟随内部链接自动爬取域名下的所有页面：

python
from scrapling.spiders import Spider, Response
from urllib.parse import urljoin, urlparse

class EasyCrawl(Spider):
自动爬取域名下的所有页面。

name = easy_crawl
start_urls = [https://example.com]
concurrent_requests = 3

def init(self):
super().init()
self.visited = set()

async def parse(self, response: Response):
# 提取内容
yield {
url: response.url,
title: response.css(title::text).get(),
h1: response.css(h1::text).get(),
}

# 跟随内部链接（限制为50页）
if len(self.visited) >= 50:
return

self.visited.add(response.url)

links = response.css(a::attr(href)).getall()[:20]
for link in links:
full_url = urljoin(response.url, link)
if full_url not in self.visited:
yield response.follow(full_url)

使用

result = EasyCrawl() result.start()

站点地图爬取

从sitemap.xml爬取页面（回退到链接发现）：

python
from scrapling.fetchers import Fetcher
from scrapling.spiders import Spider, Response
from urllib.parse import urljoin, urlparse
import re

def getsitemapurls(url: str, max_urls: int = 100) -> list:
从sitemap.xml提取URL - 同时检查robots.txt。

parsed = urlparse(url)
base_url = f{parsed.scheme}://{parsed.netloc}

sitemap_urls = [
f{base_url}/sitemap.xml,
f{base_url}/sitemap-index.xml,
f{baseurl}/sitemapindex.xml,
f{base_url}/sitemap-news.xml,
]

all_urls = []

# 首先检查robots.txt中的站点地图URL
try:
robots = Fetcher.get(f{base_url}/robots.txt)
if robots.status == 200:
sitemapinrobots = re.findall(rSitemap:\s*(\S+), robots.text, re.IGNORECASE)
for sm in sitemapinrobots:
sitemap_urls.insert(0, sm)
except:
pass

# 尝试每个站点地图位置
for sitemapurl in sitemapurls:
try:
page = Fetcher.get(sitemap_url, timeout=10)
if page.status != 200:
continue

text = page.text

# 检查是否为XML
if urls = re.findall(r([^<]+), text)
allurls.extend(urls[:maxurls])
print(f在{sitemap_url}中找到{len(urls)}个URL)
except:
continue

return list(set(allurls))[:maxurls]

def crawlfromsitemap(domainurl: str, maxpages: int = 50):
从站点地图爬取页面。

print(f正在获取{domain_url}的站点地图...)
urls = getsitemapurls(domain_url

scrapling自适应网页抓取

scrapling

Scrapling - Adaptive Web Scraping

Credits

Core Library

API Reverse Engineering Methodology

Installation

Agent Instructions

When to Use Scrapling

Quick Commands

1. Basic Fetch (Most Common)

2. Stealthy Fetch (Anti-Bot/Cloudflare)

3. Dynamic Fetch (Full Browser Automation)

4. Adaptive Parsing (Survives Design Changes)

5. Spider (Multiple Pages)

6. CLI Usage

Common Patterns

Extract Article Content

Research Multiple Pages

Crawl Entire Site (Easy Mode)

Sitemap Crawl

Firecrawl-Style Crawl (Best of Both Worlds)

Handle Errors

Session Management

Multiple Session Types in Spider

Advanced Parsing & Navigation

Advanced: API Reverse Engineering

1. API Endpoint Discovery

2. JavaScript Analysis

3. Replicating Discovered APIs

4. Cloudscraper Bypass (Cloudflare)

5. Complete API Replication Pattern

6. Discovery Checklist

Brand Data Extraction (Firecrawl Alternative)

Brand Data CLI

Feature Comparison

Examples Tested

IEEE Spectrum

Hacker News

Example Domain

🔧 Quick Troubleshooting

Skill Graph

Changelog

v1.0.8 (2026-02-25)

v1.0.7 (2026-02-25)

v1.0.6 (2026-02-25)

v1.0.5 (2026-02-25)

v1.0.4 (2026-02-25)

v1.0.3 (2026-02-25)

v1.0.2 (2026-02-25)

v1.0.1 (2026-02-25)

Scrapling - 自适应网页抓取

致谢

核心库

API逆向工程方法论

安装

核心库（仅解析器）

带获取器（HTTP + 浏览器自动化）- 推荐

带Shell（CLI工具）- 推荐

带AI（MCP服务器）- 可选

全部安装

用于隐身/动态模式的浏览器

用于Cloudflare绕过（高级）

代理指令

何时使用Scrapling

快速命令

1. 基本获取（最常用）

提取内容

2. 隐身获取（反机器人/Cloudflare）

3. 动态获取（完整浏览器自动化）

4. 自适应解析（应对设计变更）

首次抓取 - 保存选择器

之后 - 如果网站变更，使用adaptive=True重新定位

5. Spider（多页面）

6. CLI使用

简单获取到文件

隐身获取（绕过反机器人）

交互式Shell

常见模式

提取文章内容