Crawler
Web crawling and scraping reference — robots.txt protocol, Scrapy framework, anti-bot detection, headless browsers, and legal considerations. No API keys or credentials required — outputs reference documentation only.
Commands
| Command | Description |
|---|
| INLINECODE0 | Crawling vs scraping, robots.txt, sitemap |
| INLINECODE1 |
HTTP caching, structured data, meta tags |
|
troubleshooting | Anti-bot detection, JS rendering, encoding |
|
performance | Concurrency, dedup, incremental, distributed |
|
security | Legal landscape, ethical guidelines, proxies |
|
migration | BeautifulSoup to Scrapy, requests to Playwright |
|
cheatsheet | Scrapy commands, CSS/XPath, curl, user-agents |
|
faq | Legality, JS pages, blocking, storage |
Output Format
All commands output plain-text reference documentation via heredoc. No external API calls, no credentials needed, no network access.
Powered by BytesAgain | bytesagain.com | hello@bytesagain.com
技能名称:crawler
爬虫
网络爬取与抓取参考文档——涵盖robots.txt协议、Scrapy框架、反爬虫检测、无头浏览器及法律注意事项。无需API密钥或凭证——仅输出参考文档。
命令
| 命令 | 描述 |
|---|
| intro | 爬取与抓取的区别、robots.txt、站点地图 |
| standards |
HTTP缓存、结构化数据、元标签 |
| troubleshooting | 反爬虫检测、JS渲染、编码问题 |
| performance | 并发处理、去重、增量爬取、分布式 |
| security | 法律环境、道德准则、代理服务器 |
| migration | BeautifulSoup迁移至Scrapy、requests迁移至Playwright |
| cheatsheet | Scrapy命令、CSS/XPath选择器、curl命令、用户代理 |
| faq | 合法性、JS页面、被屏蔽、存储问题 |
输出格式
所有命令均通过heredoc方式输出纯文本参考文档。无需外部API调用,无需凭证,无需网络访问。
由BytesAgain提供 | bytesagain.com | hello@bytesagain.com