Nutrition Provider R2
This skill is a provider-specific orchestration wrapper around scrapling-official.
Its job is to take the payload that scrapling-official fetched, split each canonical page into individual provider records, and upload those records to Cloudflare R2.
It does not replace scrapling-official as the crawler.
Use it when the target is one of the Vietnam nutritional portal lookup listings and the job is:
- - crawl page-by-page
- preserve each provider record in raw form
- upload the records from each page to Cloudflare R2 immediately after that page is fetched
Do not normalize provider records into a custom nutrition schema. Preserve provider fields, response bodies, pagination clues, and raw linked payloads exactly as obtained whenever possible.
Required skill
This skill depends on scrapling-official for crawling.
- - If
scrapling-official is not installed or not set up yet, stop and tell the user to install and configure that skill first. - Let
scrapling-official own crawl execution, endpoint discovery, rendering mode, and fetch escalation. - Follow
scrapling-official's fetch escalation strategy exactly: start with get, then move to fetch if needed, then stealthy-fetch only when the earlier modes fail or protection requires it. - Do not fall back to a different crawler or browser stack when
scrapling-official is missing.
Workflow
- 1. Read
{baseDir}/references/source-notes.md for the default source URL, pagination clues, and stop conditions. - Confirm the R2 credentials are present:
-
R2_ACCOUNT_ID
-
R2_ACCESS_KEY_ID
-
R2_SECRET_ACCESS_KEY
-
R2_BUCKET
- 3. Ask
scrapling-official to inspect the provider page and determine which payload actually contains the canonical records for the current request. - For this provider, prefer the canonical JSON payload when
scrapling-official discovers it, instead of the outer HTML shell. - Current observed provider behavior:
- food lookup page
gia-tri-dinh-duong-thuc-pham exposes records from
GET /api/fe/foodNatunal/getPageFoodData
- prepared-dish lookup page
gia-tri-dinh-duong-mon-an exposes records from
GET /api/fe/tool/getPageFoodData
- default params observed on page load:
- foods:
page=1&pageSize=15&energy=0
- prepared dishes:
page=1&pageSize=15
- observed filter params:
- foods:
name,
category,
energy
- prepared dishes: at least
name,
energy, with additional filters visible in the UI such as group and region; let
scrapling-official discover the exact live request params
- 6. Treat the start of each page fetch as the start time for that page's pacing window.
- Save the raw payload that
scrapling-official fetched for that page without normalizing item fields. - If
scrapling-official can fetch the canonical JSON payload, treat raw.data as the list of provider records for that page. - Split that page payload into one record object per item in
data. - Upload each record object as its own R2 object.
- Record uploads from the same page may run in parallel, but every record object must use a stable object key so reruns do not create duplicates.
- Prefer a provider-stable identifier for the key:
-
_id first
- then
code
- only use another deterministic identifier if neither exists
- 13. Prefer letting the helper split and upload records directly from the page payload:
-
uv run {baseDir}/scripts/upload_page_to_r2.py --extract-foods --page-index <n> --skip-existing
- 14. The helper flag name
--extract-foods is retained for compatibility, but it may also be used for prepared-dish page payloads because both current source types return data arrays. - If the agent already split records outside the helper, it may still upload one item at a time with
--food-id. - Only capture the outer HTML page as a fallback debugging artifact when
scrapling-official cannot reach the canonical payload directly. Do not upload the HTML shell as the primary dataset. - Wait for all record uploads from the current page to finish.
- Measure total time for the page as:
- page fetch start
- plus record extraction
- plus all record uploads
- 19. If the total time for the current page is less than 60 seconds, wait the remaining time before starting the next page.
- Let
scrapling-official handle the actual pagination requests. - Use the provider payload itself to decide when to stop:
- keep paginating while the canonical payload remains non-empty
- stop when the provider indicates no more rows
- stop if a next request repeats data already seen
- 22. Never start the next page before the current page has both:
- finished all uploads
- satisfied the 60 second minimum page window
Operating Rules
- - Preserve provider data as-is. Do not rewrite field names, flatten structures, or infer a nutrition schema.
- Allow lightweight wrapper metadata only outside the raw payload, such as
source_url, fetched_at, page_index, content_type, and storage_key. - Upload one object per provider record, not one object for the whole page payload.
- Stop naturally when pagination ends. Do not invent more pages.
- INLINECODE47 is responsible for extracting or fetching the correct provider payload.
- Prefer the provider JSON API response over rendered HTML whenever
scrapling-official can access both. - Do not store the page shell HTML as the primary page payload when the JSON payload already contains the canonical rows and nutrition arrays.
- Record uploads from the same page may be concurrent.
- Use stable R2 object keys so duplicate runs overwrite or skip the same object instead of creating duplicates.
- Finish all record uploads for the current page before page
N+1 begins. - Enforce a minimum 60 second crawl-plus-upload window per page to avoid overloading the provider.
- If
scrapling-official fetches JSON from an XHR endpoint, store that JSON body unchanged. - If HTML is captured for debugging, store it separately from the canonical payload and do not treat it as the canonical dataset.
- If a page fails, retry briefly. If it still fails, upload a failure record only when the caller explicitly wants failure capture.
Concurrency
Use page-sequential crawling with record-level upload concurrency.
- - Exactly one page in flight at a time.
- Records from the same page may upload in parallel.
- Do not start page
N+1 until page N has finished all uploads. - Enforce a minimum total duration of 60 seconds for each page, measured from the start of fetch to the completion of all uploads and any required remaining wait.
R2 Settings
Required environment variables:
- - INLINECODE53
- INLINECODE54
- INLINECODE55
- INLINECODE56
Optional environment variables used by the helper when --key is not passed:
- -
R2_PREFIX default INLINECODE59 - INLINECODE60 default INLINECODE61
- INLINECODE62 default current UTC timestamp in INLINECODE63
When supporting both provider sources, do not reuse the same storage namespace for both in the same crawl run.
- - prefer
SOURCE_NAME=viendinhduong-foods for INLINECODE65 - prefer
SOURCE_NAME=viendinhduong-dishes for INLINECODE67 - or pass
--source-name explicitly per crawl job
Recommended Output Shape
Wrap the provider payload with minimal crawl metadata only when needed for storage traceability:
CODEBLOCK0
The foods endpoint currently returns page-level JSON with top-level keys data, current_page, per_page, and total. Each food item currently includes _id, code, name_vi, name_en, category, categoryEn, nutrition, and energy.
The prepared-dish endpoint currently returns page-level JSON with top-level keys current_page, data, first_page_url, from, last_page, last_page_url, links, next_page_url, path, per_page, prev_page_url, to, and total. Each dish item currently includes _id, category_id, code, description, dish_components, food_area_id, image, name_vi, name_en, nutritional_components, total_energy, category_name, category_name_en, and category_description.
Use those richer raw objects only as the source page payloads to split into per-record uploads.
Recommended per-record upload shape:
CODEBLOCK1
Upload Helper
Use uv run {baseDir}/scripts/upload_page_to_r2.py.
The helper supports two modes:
- - explicit key mode with INLINECODE109
- generated key mode from
R2_PREFIX, SOURCE_NAME, RUN_ID, --page-index, and optional INLINECODE114
Generated keys follow this layout:
- - record payload success: INLINECODE115
- debug or failure artifact: INLINECODE116
For this skill, per-record upload is the default and expected mode.
- - prefer
--extract-foods when the input is a full canonical page JSON payload from either supported source - pass a stable
--food-id only when uploading a single already-split record object - prefer
--skip-existing when reruns are possible - do not upload a whole canonical page as one object unless you are intentionally storing a debug or failure artifact
Examples:
CODEBLOCK2
CODEBLOCK3
CODEBLOCK4
CODEBLOCK5
Only for debug or failure capture:
CODEBLOCK6
Source Notes
For this provider target, use {baseDir}/references/source-notes.md.
Nutrition Provider R2
该技能是围绕 scrapling-official 构建的特定提供者编排封装。
其职责是获取 scrapling-official 抓取的负载,将每个规范页面拆分为独立的提供者记录,并将这些记录上传到 Cloudflare R2。
它不替代 scrapling-official 作为爬虫。
当目标为越南营养门户查询列表之一,且任务为以下内容时使用:
- - 逐页爬取
- 以原始形式保存每条提供者记录
- 在获取每个页面后立即将该页面的记录上传到 Cloudflare R2
不要将提供者记录规范化为自定义营养模式。尽可能保留提供者字段、响应体、分页线索和原始链接负载。
必需技能
该技能依赖 scrapling-official 进行爬取。
- - 如果 scrapling-official 未安装或未设置,请停止并告知用户先安装和配置该技能。
- 让 scrapling-official 负责爬取执行、端点发现、渲染模式和获取升级。
- 完全遵循 scrapling-official 的获取升级策略:从 get 开始,必要时切换到 fetch,仅当早期模式失败或需要防护时才使用 stealthy-fetch。
- 当 scrapling-official 缺失时,不要回退到其他爬虫或浏览器栈。
工作流程
- 1. 读取 {baseDir}/references/source-notes.md 获取默认源 URL、分页线索和停止条件。
- 确认 R2 凭证存在:
- R2
ACCOUNTID
- R2
ACCESSKEY_ID
- R2
SECRETACCESS_KEY
- R2_BUCKET
- 3. 请求 scrapling-official 检查提供者页面,确定哪个负载实际包含当前请求的规范记录。
- 对于此提供者,当 scrapling-official 发现规范 JSON 负载时,优先使用它,而不是外部的 HTML 外壳。
- 当前观察到的提供者行为:
- 食物查询页面 gia-tri-dinh-duong-thuc-pham 通过 GET /api/fe/foodNatunal/getPageFoodData 暴露记录
- 预制菜品查询页面 gia-tri-dinh-duong-mon-an 通过 GET /api/fe/tool/getPageFoodData 暴露记录
- 页面加载时观察到的默认参数:
- 食物:page=1&pageSize=15&energy=0
- 预制菜品:page=1&pageSize=15
- 观察到的过滤参数:
- 食物:name、category、energy
- 预制菜品:至少 name、energy,UI 中可见其他过滤器如分组和区域;让 scrapling-official 发现确切的实时请求参数
- 6. 将每次页面获取的开始时间视为该页面节奏窗口的开始时间。
- 保存 scrapling-official 为该页面获取的原始负载,不规范化项目字段。
- 如果 scrapling-official 可以获取规范 JSON 负载,则将 raw.data 视为该页面的提供者记录列表。
- 将该页面负载拆分为 data 中每个项目对应的一条记录对象。
- 将每条记录对象作为独立的 R2 对象上传。
- 同一页面的上传可以并行运行,但每条记录对象必须使用稳定的对象键,以便重新运行时不会创建重复项。
- 优先使用提供者稳定的标识符作为键:
- 首先使用 _id
- 然后使用 code
- 仅当两者都不存在时才使用其他确定性标识符
- 13. 优先让辅助工具直接从页面负载拆分和上传记录:
- uv run {baseDir}/scripts/upload
pageto_r2.py --extract-foods --page-index
--skip-existing
- 14. 辅助工具标志名称 --extract-foods 为兼容性保留,但也可用于预制菜品页面负载,因为两种当前源类型都返回 data 数组。
- 如果代理已在辅助工具外部拆分记录,仍可使用 --food-id 一次上传一个项目。
- 仅当 scrapling-official 无法直接访问规范负载时,才捕获外部 HTML 页面作为调试备选工件。不要将 HTML 外壳作为主要数据集上传。
- 等待当前页面的所有记录上传完成。
- 测量页面的总时间:
- 页面获取开始
- 加上记录提取
- 加上所有记录上传
- 19. 如果当前页面的总时间少于 60 秒,则在开始下一页之前等待剩余时间。
- 让 scrapling-official 处理实际的分页请求。
- 使用提供者负载本身来决定何时停止:
- 当规范负载保持非空时继续分页
- 当提供者指示没有更多行时停止
- 如果下一个请求重复已看到的数据则停止
- 22. 在当前页面同时满足以下条件之前,绝不开始下一页:
- 完成所有上传
- 满足 60 秒的最小页面窗口
操作规则
- - 保持提供者数据原样。不要重写字段名称、扁平化结构或推断营养模式。
- 仅在原始负载之外允许轻量级包装元数据,例如 sourceurl、fetchedat、pageindex、contenttype 和 storage_key。
- 每条提供者记录上传一个对象,而不是整个页面负载上传一个对象。
- 在分页结束时自然停止。不要发明更多页面。
- scrapling-official 负责提取或获取正确的提供者负载。
- 当 scrapling-official 可以同时访问 JSON API 响应和渲染 HTML 时,优先使用 JSON API 响应。
- 当 JSON 负载已包含规范行和营养数组时,不要将页面外壳 HTML 存储为主要页面负载。
- 同一页面的记录上传可以并发。
- 使用稳定的 R2 对象键,以便重复运行覆盖或跳过同一对象,而不是创建重复项。
- 在页面 N+1 开始之前,完成当前页面的所有记录上传。
- 对每个页面强制执行至少 60 秒的爬取加上传窗口,以避免使提供者过载。
- 如果 scrapling-official 从 XHR 端点获取 JSON,则原样存储该 JSON 主体。
- 如果为调试目的捕获 HTML,则将其与规范负载分开存储,并且不将其视为规范数据集。
- 如果页面失败,短暂重试。如果仍然失败,仅当调用者明确要求捕获失败时才上传失败记录。
并发
使用页面顺序爬取和记录级上传并发。
- - 一次只处理一个页面。
- 同一页面的记录可以并行上传。
- 在页面 N 完成所有上传之前,不要开始页面 N+1。
- 对每个页面强制执行至少 60 秒的最小总持续时间,从获取开始到所有上传完成以及任何所需的剩余等待时间。
R2 设置
必需的环境变量:
- - R2ACCOUNTID
- R2ACCESSKEYID
- R2SECRETACCESSKEY
- R2_BUCKET
当未传递 --key 时,辅助工具使用的可选环境变量:
- - R2PREFIX 默认为 raw
- SOURCENAME 默认为 nutrition-provider
- RUN_ID 默认为当前 UTC 时间戳,格式为 YYYY-MM-DDTHH-MM-SSZ
当支持两个提供者源时,在同一爬取运行中不要为两者重用相同的存储命名空间。
- - 对于 gia-tri-dinh-duong-thuc-pham,优先使用 SOURCENAME=viendinhduong-foods
- 对于 gia-tri-dinh-duong-mon-an,优先使用 SOURCENAME=viendinhduong-dishes
- 或按爬取作业显式传递 --source-name
推荐输出格式
仅在需要存储可追溯性时,用最少的爬取元数据包装提供者负载:
json
{
source_url: https://viendinhduong.vn/api/fe/foodNatunal/getPageFoodData?page=1&pageSize=15&energy=0,
page_index: 1,
fetched_at: 2026-03-15T10:00:00Z,
content_type: application/json,
raw: {
data: [],
current_page: 1,
per_page: 15,
total: 853
}
}
食物端点当前返回页面级 JSON,包含顶级键 data、currentpage、perpage 和 total。每个食物项目当前包含 id、code、namevi、name_en、category、categoryEn、