Scrapling - Adaptive Web Scraping
"Effortless web scraping for the modern web."
Credits
Core Library
- - Repository: https://github.com/D4Vinci/Scrapling
- Author: D4Vinci (Karim Shoair)
- License: BSD-3-Clause
- Documentation: https://scrapling.readthedocs.io
API Reverse Engineering Methodology
- - GitHub: https://github.com/paoloanzn/free-solscan-api
- X Post: https://x.com/paoloanzn/status/2026361234032046319
- Author: @paoloanzn
- Insight: "Web scraping is 80% reverse engineering"
Installation
CODEBLOCK0
Agent Instructions
When to Use Scrapling
Use Scrapling when:
- - Research topics from websites
- Extract data from blogs, news sites, docs
- Crawl multiple pages with Spider
- Gather content for summaries
- Extract brand data from any website
- Reverse engineer APIs from websites
Do NOT use for:
- - X/Twitter (use x-tweet-fetcher skill)
- Login-protected sites (unless credentials provided)
- Paywalled content (respect robots.txt)
- Sites that prohibit scraping in their TOS
Quick Commands
1. Basic Fetch (Most Common)
CODEBLOCK1
2. Stealthy Fetch (Anti-Bot/Cloudflare)
CODEBLOCK2
3. Dynamic Fetch (Full Browser Automation)
CODEBLOCK3
4. Adaptive Parsing (Survives Design Changes)
CODEBLOCK4
5. Spider (Multiple Pages)
CODEBLOCK5
6. CLI Usage
CODEBLOCK6
Common Patterns
Extract Article Content
CODEBLOCK7
Research Multiple Pages
CODEBLOCK8
Crawl Entire Site (Easy Mode)
Auto-crawl all pages on a domain by following internal links:
CODEBLOCK9
Sitemap Crawl
Crawl pages from sitemap.xml (with fallback to link discovery):
CODEBLOCK10
Firecrawl-Style Crawl (Best of Both Worlds)
Inspired by Firecrawl's behavior - combines sitemap discovery with link following:
CODEBLOCK11
Handle Errors
CODEBLOCK12
Session Management
CODEBLOCK13
Multiple Session Types in Spider
CODEBLOCK14
Advanced Parsing & Navigation
CODEBLOCK15
Advanced: API Reverse Engineering
"Web scraping is 80% reverse engineering."
This section covers advanced techniques to discover and replicate APIs directly from websites — often revealing data that's "hidden" behind paid APIs.
1. API Endpoint Discovery
Many websites load data via client-side requests. Use browser DevTools to find them:
Steps:
- 1. Open browser DevTools (F12)
- Go to Network tab
- Reload the page
- Look for XHR or Fetch requests
- Check if endpoints return JSON data
What to look for:
- - Requests to
/api/* endpoints - Responses containing structured data (JSON)
- Same endpoints used on both free and paid sections
Example pattern:
CODEBLOCK16
2. JavaScript Analysis
Auth tokens often generated client-side. Find them in .js files:
Steps:
- 1. In Network tab, look at Initiator column
- Click the
.js file making the request - Search for auth header name (e.g.,
sol-aut, Authorization, X-API-Key) - Find the function generating the token
Common patterns:
- - Plain text function names:
generateToken(), INLINECODE8 - Obfuscated: Search for the header name directly
- Random string generation:
Math.random(), INLINECODE10
3. Replicating Discovered APIs
Once you've found the endpoint and auth pattern:
CODEBLOCK17
4. Cloudscraper Bypass (Cloudflare)
For Cloudflare-protected endpoints, use cloudscraper:
CODEBLOCK18
CODEBLOCK19
5. Complete API Replication Pattern
CODEBLOCK20
6. Discovery Checklist
When approaching a new site:
| Step | Action | Tool |
|---|
| 1 | Open DevTools Network tab | F12 |
| 2 |
Reload page, filter by XHR/Fetch | Network filter |
| 3 | Look for JSON responses | Response tab |
| 4 | Check if same endpoint used for "premium" data | Compare requests |
| 5 | Find auth header in JS files | Initiator column |
| 6 | Extract token generation logic | JS debugger |
| 7 | Replicate in Python | Replicator class |
| 8 | Test against API | Run script |
Brand Data Extraction (Firecrawl Alternative)
Extract brand data, colors, logos, and copy from any website:
CODEBLOCK21
Brand Data CLI
CODEBLOCK22
Feature Comparison
| Feature | Status | Notes |
|---|
| Basic fetch | ✅ Working | Fetcher.get() |
| Stealthy fetch |
✅ Working | StealthyFetcher.fetch() |
| Dynamic fetch | ✅ Working | DynamicFetcher.fetch() |
| Adaptive parsing | ✅ Working | auto_save + adaptive |
| Spider crawling | ✅ Working | async def parse() |
| CSS selectors | ✅ Working | .css() |
| XPath | ✅ Working | .xpath() |
| Session management | ✅ Working | FetcherSession, StealthySession |
| Proxy rotation | ✅ Working | ProxyRotator class |
| CLI tools | ✅ Working | scrapling extract |
| Brand data extraction | ✅ Working | extract
branddata() |
| API reverse engineering | ✅ Working | APIReplicator class |
| Cloudscraper bypass | ✅ Working | cloudscraper integration |
| Easy site crawl | ✅ Working | EasyCrawl class |
| Sitemap crawl | ✅ Working | get
sitemapurls() |
| MCP server | ❌ Excluded | Not needed |
Examples Tested
IEEE Spectrum
page = Fetcher.get('https://spectrum.ieee.org/...')
title = page.css('h1::text').get()
content = page.css('article p::text').getall()
✅ Works
Hacker News
page = Fetcher.get('https://news.ycombinator.com')
stories = page.css('.titleline a::text').getall()
✅ Works
Example Domain
page = Fetcher.get('https://example.com')
title = page.css('h1::text').get()
✅ Works
🔧 Quick Troubleshooting
| Issue | Solution |
|---|
| 403/429 Blocked | Use StealthyFetcher or cloudscraper |
| Cloudflare |
Use StealthyFetcher or cloudscraper |
| JavaScript required | Use DynamicFetcher |
| Site changed | Use adaptive=True |
| Paid API exposed | Use API reverse engineering |
| Captcha | Cannot bypass - skip or use official API |
| Auth required | Do NOT bypass - use official API |
Skill Graph
Related skills:
- - [[content-research]] - Research workflow
- [[blogwatcher]] - RSS/feed monitoring
- [[youtube-watcher]] - Video content
- [[chirp]] - Twitter/X interactions
- [[newsletter-digest]] - Content summarization
- [[x-tweet-fetcher]] - X/Twitter (use instead of Scrapling)
Changelog
v1.0.8 (2026-02-25)
- - Added: Firecrawl-Style Crawl - Combines sitemap discovery + link following
- Added: use_sitemap parameter - Matches Firecrawl's sitemap:"include"/"skip" behavior
- Verified: cloudflare.com returns 2,447 URLs from sitemap!
v1.0.7 (2026-02-25)
- - Fixed: EasyCrawl Spider syntax - Updated to work with scrapling's actual Spider API
- Verified: Spider crawling works - Tested and crawled 20+ pages from example.com
v1.0.6 (2026-02-25)
- - Added: Easy Site Crawl - Auto-crawl all pages on a domain with EasyCrawl spider
- Added: Sitemap Crawl - Extract URLs from sitemap.xml and crawl them
- Feature parity with Firecrawl for site crawling capabilities
v1.0.5 (2026-02-25)
- - Enhanced: API Reverse Engineering methodology
- Detailed step-by-step process from @paoloanzn's work
- Real Solscan case study with exact timeline
- Added: Step-by-step methodology section
- Added: Real example documentation (Solscan March 2025 vs Feb 2026)
- Added: Discovery checklist with 10 steps
- Documented: How to find auth headers in JS files
- Documented: Token generation pattern extraction
- Updated: Cloudscraper integration with multi-attempt pattern
- Verified: Solscan now patched (Cloudflare on both endpoints)
v1.0.4 (2026-02-25)
- - Fixed: Brand Data Extraction API - Corrected selectors for scrapling's Response object
- Fixed
.html → .text / INLINECODE14 - Fixed
.title() → INLINECODE16 - Fixed
.logo img::src → INLINECODE18 - Tested and verified working
v1.0.3 (2026-02-25)
- - Added: API Reverse Engineering section
- API Endpoint Discovery (Network tab analysis)
- JavaScript Analysis (finding auth logic)
- Cloudscraper integration for Cloudflare bypass
- Complete APIReplicator class
- Discovery checklist
- - Added cloudscraper to installation
v1.0.2 (2026-02-25)
- - Synced with upstream GitHub README exactly
- Added Brand Data Extraction section
- Clean, core-only version
v1.0.1 (2026-02-25)
- - Synced with original Scrapling GitHub README
Last updated: 2026-02-25
Scrapling - 自适应网页抓取
为现代网页提供轻松的网页抓取体验。
致谢
核心库
- - 仓库地址: https://github.com/D4Vinci/Scrapling
- 作者: D4Vinci (Karim Shoair)
- 许可证: BSD-3-Clause
- 文档: https://scrapling.readthedocs.io
API逆向工程方法论
- - GitHub: https://github.com/paoloanzn/free-solscan-api
- X帖子: https://x.com/paoloanzn/status/2026361234032046319
- 作者: @paoloanzn
- 洞见: 网页抓取80%是逆向工程
安装
bash
核心库(仅解析器)
pip install scrapling
带获取器(HTTP + 浏览器自动化)- 推荐
pip install scrapling[fetchers]
scrapling install
带Shell(CLI工具)- 推荐
pip install scrapling[shell]
带AI(MCP服务器)- 可选
pip install scrapling[ai]
全部安装
pip install scrapling[all]
用于隐身/动态模式的浏览器
playwright install chromium
用于Cloudflare绕过(高级)
pip install cloudscraper
代理指令
何时使用Scrapling
使用Scrapling的场景:
- - 从网站研究主题
- 从博客、新闻网站、文档中提取数据
- 使用Spider爬取多个页面
- 收集内容用于摘要
- 从任何网站提取品牌数据
- 从网站逆向工程API
不要用于:
- - X/Twitter(使用x-tweet-fetcher技能)
- 需要登录的网站(除非提供了凭证)
- 付费墙内容(遵守robots.txt)
- 在其服务条款中禁止抓取的网站
快速命令
1. 基本获取(最常用)
python
from scrapling.fetchers import Fetcher
page = Fetcher.get(https://example.com)
提取内容
title = page.css(h1::text).get()
paragraphs = page.css(p::text).getall()
2. 隐身获取(反机器人/Cloudflare)
python
from scrapling.fetchers import StealthyFetcher
StealthyFetcher.adaptive = True
page = StealthyFetcher.fetch(https://example.com, headless=True, solve_cloudflare=True)
3. 动态获取(完整浏览器自动化)
python
from scrapling.fetchers import DynamicFetcher
page = DynamicFetcher.fetch(https://example.com, headless=True, network_idle=True)
4. 自适应解析(应对设计变更)
python
from scrapling.fetchers import Fetcher
page = Fetcher.get(https://example.com)
首次抓取 - 保存选择器
items = page.css(.product, auto_save=True)
之后 - 如果网站变更,使用adaptive=True重新定位
items = page.css(.product, adaptive=True)
5. Spider(多页面)
python
from scrapling.spiders import Spider, Response
class MySpider(Spider):
name = demo
start_urls = [https://example.com]
concurrent_requests = 3
async def parse(self, response: Response):
for item in response.css(.item):
yield {item: item.css(h2::text).get()}
# 跟随链接
next_page = response.css(.next a)
if next_page:
yield response.follow(next_page[0].attrib[href])
MySpider().start()
6. CLI使用
bash
简单获取到文件
scrapling extract get https://example.com content.html
隐身获取(绕过反机器人)
scrapling extract stealthy-fetch https://example.com content.html
交互式Shell
scrapling shell https://example.com
常见模式
提取文章内容
python
from scrapling.fetchers import Fetcher
page = Fetcher.get(https://example.com/article)
尝试多个选择器获取标题
title = (
page.css([itemprop=headline]::text).get() or
page.css(article h1::text).get() or
page.css(h1::text).get()
)
获取段落
content = page.css(article p::text, .article-body p::text).getall()
print(f标题:{title})
print(f段落数:{len(content)})
研究多个页面
python
from scrapling.spiders import Spider, Response
class ResearchSpider(Spider):
name = research
start_urls = [https://news.ycombinator.com]
concurrent_requests = 5
async def parse(self, response: Response):
for item in response.css(.titleline a::text).getall()[:10]:
yield {title: item, source: HN}
more = response.css(.morelink::attr(href)).get()
if more:
yield response.follow(more)
ResearchSpider().start()
爬取整个网站(简易模式)
通过跟随内部链接自动爬取域名下的所有页面:
python
from scrapling.spiders import Spider, Response
from urllib.parse import urljoin, urlparse
class EasyCrawl(Spider):
自动爬取域名下的所有页面。
name = easy_crawl
start_urls = [https://example.com]
concurrent_requests = 3
def init(self):
super().init()
self.visited = set()
async def parse(self, response: Response):
# 提取内容
yield {
url: response.url,
title: response.css(title::text).get(),
h1: response.css(h1::text).get(),
}
# 跟随内部链接(限制为50页)
if len(self.visited) >= 50:
return
self.visited.add(response.url)
links = response.css(a::attr(href)).getall()[:20]
for link in links:
full_url = urljoin(response.url, link)
if full_url not in self.visited:
yield response.follow(full_url)
使用
result = EasyCrawl()
result.start()
站点地图爬取
从sitemap.xml爬取页面(回退到链接发现):
python
from scrapling.fetchers import Fetcher
from scrapling.spiders import Spider, Response
from urllib.parse import urljoin, urlparse
import re
def getsitemapurls(url: str, max_urls: int = 100) -> list:
从sitemap.xml提取URL - 同时检查robots.txt。
parsed = urlparse(url)
base_url = f{parsed.scheme}://{parsed.netloc}
sitemap_urls = [
f{base_url}/sitemap.xml,
f{base_url}/sitemap-index.xml,
f{baseurl}/sitemapindex.xml,
f{base_url}/sitemap-news.xml,
]
all_urls = []
# 首先检查robots.txt中的站点地图URL
try:
robots = Fetcher.get(f{base_url}/robots.txt)
if robots.status == 200:
sitemapinrobots = re.findall(rSitemap:\s*(\S+), robots.text, re.IGNORECASE)
for sm in sitemapinrobots:
sitemap_urls.insert(0, sm)
except:
pass
# 尝试每个站点地图位置
for sitemapurl in sitemapurls:
try:
page = Fetcher.get(sitemap_url, timeout=10)
if page.status != 200:
continue
text = page.text
# 检查是否为XML
if
urls = re.findall(r([^<]+), text)
allurls.extend(urls[:maxurls])
print(f在{sitemap_url}中找到{len(urls)}个URL)
except:
continue
return list(set(allurls))[:maxurls]
def crawlfromsitemap(domainurl: str, maxpages: int = 50):
从站点地图爬取页面。
print(f正在获取{domain_url}的站点地图...)
urls = getsitemapurls(domain_url