XCrawl Scrape
Overview
This skill handles single-page extraction with XCrawl Scrape APIs.
Default behavior is raw passthrough: return upstream API response bodies as-is.
Required Local Config
Before using this skill, the user must create a local config file and write XCRAWL_API_KEY into it.
Path: INLINECODE1
CODEBLOCK0
Read API key from local config file only. Do not require global environment variables.
Credits and Account Setup
Using XCrawl APIs consumes credits.
If the user does not have an account or available credits, guide them to register at https://dash.xcrawl.com/.
After registration, they can activate the free 1000 credits plan before running requests.
Tool Permission Policy
Request runtime permissions for curl and node only.
Do not request Python, shell helper scripts, or other runtime permissions.
API Surface
- - Start scrape: INLINECODE6
- Read async result: INLINECODE7
- Base URL: INLINECODE8
- Required header: INLINECODE9
Usage Examples
cURL (sync)
CODEBLOCK1
cURL (async create + result)
CODEBLOCK2
Node
CODEBLOCK3
Request Parameters
Request endpoint and headers
- - Endpoint: INLINECODE10
- Headers:
- INLINECODE11
- INLINECODE12
Request body: top-level fields
| Field | Type | Required | Default | Description |
|---|
| INLINECODE13 | string | Yes | - | Target URL |
| INLINECODE14 |
string | No |
sync |
sync or
async |
|
proxy | object | No | - | Proxy config |
|
request | object | No | - | Request config |
|
js_render | object | No | - | JS rendering config |
|
output | object | No | - | Output config |
|
webhook | object | No | - | Async webhook config (
mode=async) |
proxy
| Field | Type | Required | Default | Description |
|---|
| INLINECODE25 | string | No | INLINECODE26 | ISO-3166-1 alpha-2 country code, e.g. US / JP / INLINECODE29 |
| INLINECODE30 |
string | No | Auto-generated | Sticky session ID; same ID attempts to reuse exit |
request
| Field | Type | Required | Default | Description |
|---|
| INLINECODE32 | string | No | INLINECODE33 | Affects INLINECODE34 |
| INLINECODE35 |
string | No |
desktop |
desktop /
mobile; affects UA and viewport |
|
cookies | object map | No | - | Cookie key/value pairs |
|
headers | object map | No | - | Header key/value pairs |
|
only_main_content | boolean | No |
true | Return main content only |
|
block_ads | boolean | No |
true | Attempt to block ad resources |
|
skip_tls_verification | boolean | No |
true | Skip TLS verification |
js_render
| Field | Type | Required | Default | Description |
|---|
| INLINECODE48 | boolean | No | INLINECODE49 | Enable browser rendering |
| INLINECODE50 |
string | No |
load |
load /
domcontentloaded /
networkidle |
|
viewport.width | integer | No | - | Viewport width (desktop
1920, mobile
402) |
|
viewport.height | integer | No | - | Viewport height (desktop
1080, mobile
874) |
output
| Field | Type | Required | Default | Description |
|---|
| INLINECODE62 | string[] | No | INLINECODE63 | Output formats |
| INLINECODE64 |
string | No |
viewport |
full_page /
viewport (only if
formats includes
screenshot) |
|
json.prompt | string | No | - | Extraction prompt |
|
json.json_schema | object | No | - | JSON Schema |
INLINECODE72 enum:
- - INLINECODE73
- INLINECODE74
- INLINECODE75
- INLINECODE76
- INLINECODE77
- INLINECODE78
- INLINECODE79
webhook
| Field | Type | Required | Default | Description |
|---|
| INLINECODE81 | string | No | - | Callback URL |
| INLINECODE82 |
object map | No | - | Custom callback headers |
|
events | string[] | No |
["started","completed","failed"] | Events:
started /
completed /
failed |
Response Parameters
Sync create response (mode=sync)
| Field | Type | Description |
|---|
| INLINECODE89 | string | Task ID |
| INLINECODE90 |
string | Always
scrape |
|
version | string | Version |
|
status | string |
completed /
failed |
|
url | string | Target URL |
|
data | object | Result data |
|
started_at | string | Start time (ISO 8601) |
|
ended_at | string | End time (ISO 8601) |
|
total_credits_used | integer | Total credits used |
INLINECODE101 fields (based on output.formats):
- -
html, raw_html, markdown, links, summary, screenshot, INLINECODE109 - INLINECODE110 (page metadata)
- INLINECODE111
- INLINECODE112
- INLINECODE113
INLINECODE114 fields:
| Field | Type | Description |
|---|
| INLINECODE115 | integer | Base scrape cost |
| INLINECODE116 |
integer | Traffic cost |
|
json_extract_cost | integer | JSON extraction cost |
Async create response (mode=async)
| Field | Type | Description |
|---|
| INLINECODE119 | string | Task ID |
| INLINECODE120 |
string | Always
scrape |
|
version | string | Version |
|
status | string | Always
pending |
Async result response (GET /v1/scrape/{scrape_id})
| Field | Type | Description |
|---|
| INLINECODE126 | string | Task ID |
| INLINECODE127 |
string | Always
scrape |
|
version | string | Version |
|
status | string |
pending /
crawling /
completed /
failed |
|
url | string | Target URL |
|
data | object | Same shape as sync
data |
|
started_at | string | Start time (ISO 8601) |
|
ended_at | string | End time (ISO 8601) |
Workflow
- 1. Restate the user goal as an extraction contract.
- - URL scope, required fields, accepted nulls, and precision expectations.
- 2. Build the scrape request body.
- - Keep only necessary options.
- Prefer explicit
output.formats.
- 3. Execute scrape and capture task metadata.
- - Track
scrape_id, status, and timestamps. - If async, poll until
completed or failed.
- 4. Return raw API responses directly.
- - Do not synthesize or compress fields by default.
Output Contract
Return:
- - Endpoint(s) used and mode (
sync or async) - INLINECODE147 used for the request
- Raw response body from each API call
- Error details when request fails
Do not generate summaries unless the user explicitly requests a summary.
Guardrails
- - Do not invent unsupported output fields.
- Do not hardcode provider-specific tool schemas in core logic.
- Call out uncertainty when page structure is unstable.
XCrawl Scrape
概述
该技能使用 XCrawl Scrape API 处理单页提取。
默认行为为原始透传:按原样返回上游 API 响应体。
所需本地配置
使用此技能前,用户必须创建本地配置文件并写入 XCRAWLAPIKEY。
路径:~/.xcrawl/config.json
json
{
XCRAWLAPIKEY: apikey>
}
仅从本地配置文件读取 API 密钥。无需全局环境变量。
积分与账户设置
使用 XCrawl API 会消耗积分。
如果用户没有账户或可用积分,引导其前往 https://dash.xcrawl.com/ 注册。
注册后,可在运行请求前激活免费的 1000 积分套餐。
工具权限策略
仅请求 curl 和 node 的运行时权限。
不请求 Python、Shell 辅助脚本或其他运行时权限。
API 接口
- - 开始抓取:POST /v1/scrape
- 读取异步结果:GET /v1/scrape/{scrapeid}
- 基础 URL:https://run.xcrawl.com
- 必需请求头:Authorization: Bearer API_KEY>
使用示例
cURL(同步)
bash
APIKEY=$(node -e const fs=require(fs);const p=process.env.HOME+/.xcrawl/config.json;const k=JSON.parse(fs.readFileSync(p,utf8)).XCRAWLAPI_KEY||;process.stdout.write(k))
curl -sS -X POST https://run.xcrawl.com/v1/scrape \
-H Content-Type: application/json \
-H Authorization: Bearer ${API_KEY} \
-d {url:https://example.com,mode:sync,output:{formats:[markdown,links]}}
cURL(异步创建 + 结果)
bash
APIKEY=$(node -e const fs=require(fs);const p=process.env.HOME+/.xcrawl/config.json;const k=JSON.parse(fs.readFileSync(p,utf8)).XCRAWLAPI_KEY||;process.stdout.write(k))
CREATE_RESP=$(curl -sS -X POST https://run.xcrawl.com/v1/scrape \
-H Content-Type: application/json \
-H Authorization: Bearer ${API_KEY} \
-d {url:https://example.com/product/1,mode:async,output:{formats:[json]},json:{prompt:提取标题和价格。}})
echo $CREATE_RESP
SCRAPEID=$(node -e const s=process.argv[1];const j=JSON.parse(s);process.stdout.write(j.scrapeid||) $CREATE_RESP)
curl -sS -X GET https://run.xcrawl.com/v1/scrape/${SCRAPE_ID} \
-H Authorization: Bearer ${API_KEY}
Node
bash
node -e
const fs=require(fs);
const apiKey=JSON.parse(fs.readFileSync(process.env.HOME+/.xcrawl/config.json,utf8)).XCRAWLAPIKEY;
const body={url:https://example.com,mode:sync,output:{formats:[markdown,json]},json:{prompt:提取标题和发布日期。}};
fetch(https://run.xcrawl.com/v1/scrape,{
method:POST,
headers:{Content-Type:application/json,Authorization:Bearer ${apiKey}},
body:JSON.stringify(body)
}).then(async r=>{console.log(await r.text());});
请求参数
请求端点与请求头
- - 端点:POST https://run.xcrawl.com/v1/scrape
- 请求头:
- Content-Type: application/json
- Authorization: Bearer
请求体:顶层字段
| 字段 | 类型 | 必需 | 默认值 | 描述 |
|---|
| url | 字符串 | 是 | - | 目标 URL |
| mode |
字符串 | 否 | sync | sync 或 async |
| proxy | 对象 | 否 | - | 代理配置 |
| request | 对象 | 否 | - | 请求配置 |
| js_render | 对象 | 否 | - | JS 渲染配置 |
| output | 对象 | 否 | - | 输出配置 |
| webhook | 对象 | 否 | - | 异步 Webhook 配置(mode=async) |
proxy
| 字段 | 类型 | 必需 | 默认值 | 描述 |
|---|
| location | 字符串 | 否 | US | ISO-3166-1 alpha-2 国家代码,例如 US / JP / SG |
| sticky_session |
字符串 | 否 | 自动生成 | 粘性会话 ID;相同 ID 会尝试复用出口 |
request
| 字段 | 类型 | 必需 | 默认值 | 描述 |
|---|
| locale | 字符串 | 否 | en-US,en;q=0.9 | 影响 Accept-Language |
| device |
字符串 | 否 | desktop | desktop / mobile;影响 UA 和视口 |
| cookies | 对象映射 | 否 | - | Cookie 键值对 |
| headers | 对象映射 | 否 | - | 请求头键值对 |
| only
maincontent | 布尔值 | 否 | true | 仅返回主要内容 |
| block_ads | 布尔值 | 否 | true | 尝试屏蔽广告资源 |
| skip
tlsverification | 布尔值 | 否 | true | 跳过 TLS 验证 |
js_render
| 字段 | 类型 | 必需 | 默认值 | 描述 |
|---|
| enabled | 布尔值 | 否 | true | 启用浏览器渲染 |
| wait_until |
字符串 | 否 | load | load / domcontentloaded / networkidle |
| viewport.width | 整数 | 否 | - | 视口宽度(桌面端 1920,移动端 402) |
| viewport.height | 整数 | 否 | - | 视口高度(桌面端 1080,移动端 874) |
output
| 字段 | 类型 | 必需 | 默认值 | 描述 |
|---|
| formats | 字符串数组 | 否 | [markdown] | 输出格式 |
| screenshot |
字符串 | 否 | viewport | full_page / viewport(仅当 formats 包含 screenshot 时) |
| json.prompt | 字符串 | 否 | - | 提取提示 |
| json.json_schema | 对象 | 否 | - | JSON Schema |
output.formats 枚举:
- - html
- raw_html
- markdown
- links
- summary
- screenshot
- json
webhook
| 字段 | 类型 | 必需 | 默认值 | 描述 |
|---|
| url | 字符串 | 否 | - | 回调 URL |
| headers |
对象映射 | 否 | - | 自定义回调请求头 |
| events | 字符串数组 | 否 | [started,completed,failed] | 事件:started / completed / failed |
响应参数
同步创建响应(mode=sync)
| 字段 | 类型 | 描述 |
|---|
| scrape_id | 字符串 | 任务 ID |
| endpoint |
字符串 | 始终为 scrape |
| version | 字符串 | 版本 |
| status | 字符串 | completed / failed |
| url | 字符串 | 目标 URL |
| data | 对象 | 结果数据 |
| started_at | 字符串 | 开始时间(ISO 8601) |
| ended_at | 字符串 | 结束时间(ISO 8601) |
| total
creditsused | 整数 | 使用的总积分 |
data 字段(基于 output.formats):
- - html、raw_html、markdown、links、summary、screenshot、json
- metadata(页面元数据)
-