A Share Site Crawl

Use this skill to collect public A-share information from the five target sites and to convert raw site access into repeatable summary-ready records.

Read Order

Always read these first:

- INLINECODE0
INLINECODE1

Read these in addition when the task involves formal collection, normalization, or recurring jobs:

- INLINECODE2
INLINECODE3
INLINECODE4

Use references/entrypoints.md for fixed site entry pages, verification status, cron priorities, and default crawl mode.

Use references/fields.md for the normalized schema, source tiering, credibility, opinion-risk handling, content typing, cron retention, time normalization, ticker normalization, and dedup rules.

Use references/risks.md for P0/P1/P2 risks, recognition signals, and downgrade or mitigation decisions.

Core Rule

Prefer browser for page truth and web_fetch for cheap probing.

- Use web_fetch first when the site is known to have stable public text pages
Use browser first when the site is dynamic, disclosure-driven, or clearly stronger in rendered form
If both fail, report the site as restricted or missing instead of pretending it was covered
Do not treat anti-bot code, disclaimers, shells, or login walls as usable content

Working Workflow

1. Start from the correct page type

- Prefer fixed entrypoints, list pages, search pages, disclosure pages, telegraph streams, and stock-detail pages
Do not judge 巨潮资讯 from homepage-only text
Do not rely on noisy portal homepages when a better inner page exists

2. Probe and classify access

Judge each probe into one of these buckets:

- usable: readable and materially sufficient
INLINECODE13: some content is real, but clearly incomplete
INLINECODE14: mainly navigation, scripts, disclaimers, or boilerplate
INLINECODE15: anti-bot, login wall, or meaningless payload

3. Choose extraction mode

Use one of these verdicts per site or page:

- INLINECODE16
INLINECODE17
INLINECODE18
INLINECODE19

4. Keep site roles distinct

- 巨潮资讯: official confirmation and disclosure verification
东方财富: public aggregation, data-center navigation, and quasi-structured market pages
财联社: fast market events and telegraph flow
韭研公社: topic logic, timeline, and community clue discovery
雪球: sentiment, heat, stock-detail snapshots, and community discussion

5. Normalize before summarizing

When the task is more than a one-off crawl check, convert findings into normalized records using references/fields.md.

Minimum normalization discipline:

- assign source_tier, credibility, content_type, and INLINECODE24
normalize time to Asia/Shanghai when possible
normalize A-share tickers conservatively
deduplicate repeated event coverage
separate confirmed facts from market claims and sentiment

6. Apply downgrade rules early

Use references/risks.md when deciding whether to downgrade, defer, or replace a source.

Default downgrade behavior:

- login-gated or anti-bot content -> INLINECODE26
shell-only or disclaimer-heavy result -> switch entrypoint or switch tool
财联社 telegraph 默认先保留列表正文; only hit detail when the list is truncated, a canonical URL is needed, or an original-source jump matters
巨潮公告默认先保留列表元数据; only chase PDF when the title is high-value enough to justify body extraction, otherwise keep title-derived summary and mark that PDF body was not extracted
community-only claim without confirmation -> keep as clue, not fact
unavailable priority site -> disclose it and use approved fallback public sources

Default Site Priority

Use this order for stable public collection when the task does not specify a scenario:

1. 东方财富
财联社
巨潮资讯
韭研公社
雪球

This order reflects public accessibility and extraction stability, not market importance.

When to Ask for Stronger Access

Ask for stronger access only when the user explicitly wants better extraction from a restricted site, especially 雪球.

Examples:

- attached Chrome relay tab
logged-in browser profile
cookies or authenticated environment
a dedicated crawler or site-specific script

Scenario Call Contract

When a cron or caller specifies one of these scenario ids, treat it as a compact instruction bundle and do not ask for a longer prompt:

- pre-open: read references/entrypoints.md, references/fields.md, and references/risks.md; use the pre-open priority order; focus on overnight macro or overseas linkage, policy or industry catalysts, key announcements, expected hot sectors, and today's watchlist
INLINECODE32: read references/entrypoints.md, references/fields.md, and references/risks.md; use the intraday priority order; focus on morning index and turnover snapshot, leading or lagging themes, style or sentiment shifts, active stocks with catalysts, and deviation from the pre-open setup
INLINECODE36: read references/entrypoints.md, references/fields.md, and references/risks.md; use the intraday priority order; focus on whether the afternoon main line strengthens or rotates, late-session anomalies, money-flow return direction, hot-stock persistence, and signals that may affect post-close review or next-day expectations
INLINECODE40: read references/entrypoints.md, references/fields.md, and references/risks.md; use the post-close priority order; focus on index and turnover recap, main-line review, key stocks and drivers, important announcements plus exchange or regulator dynamics, and next-day clues with risks

For every scenario:

- keep the output in Chinese and lead with conclusions before detail
keep 已确认事实, 市场观点与情绪, and 待核实线索 clearly separated
keep 本轮缺失站点 and 来源层级说明 in the final output
bind every round to the entrypoint, field-normalization, and risk-downgrade rules instead of freehand summarizing
do not output buy or sell recommendations

Standard Output

When producing a formal round output, always structure it with at least these sections:

- INLINECODE49
INLINECODE50
INLINECODE51
INLINECODE52
INLINECODE53

Use the sections as follows:

- 已确认事实: only T1 or well-supported T2 items, or items clearly marked as partially confirmed
INLINECODE55: T3 discussion, heat, consensus drift, and sentiment signals
INLINECODE56: rumors, single-source community claims, partial clues, or conflicting statements
INLINECODE57: blocked, unstable, login-gated, or otherwise uncovered priority sites and what fallback was used
INLINECODE58: explain T1/T2/T3 usage and remind the reader that community sources are not equal to formal disclosure

Per-Site Quick Output for Crawlability Tasks

When the task is specifically about site feasibility rather than a market summary, return:

- Site
Status
Recommended mode
Best entry page
What works
Main limitation
Next step

Non-Negotiables

- Distinguish confirmed facts from community opinion
Prefer official disclosure and high-confidence public reporting over discussion boards
Do not output buy/sell recommendations
Do not imply full coverage when a priority site failed or was inaccessible

A股站点爬取

使用此技能从五个目标站点收集公开的A股信息，并将原始站点访问转换为可重复的、可供总结的记录。

阅读顺序

始终优先阅读以下内容：

- references/sites.md
references/workflow.md

当任务涉及正式采集、标准化或定期作业时，还需阅读以下内容：

- references/entrypoints.md
references/fields.md
references/risks.md

使用 references/entrypoints.md 获取固定站点入口页面、验证状态、定时任务优先级和默认爬取模式。

使用 references/fields.md 获取标准化模式、来源层级、可信度、观点风险处理、内容类型、定时任务保留策略、时间标准化、股票代码标准化和去重规则。

使用 references/risks.md 获取P0/P1/P2风险、识别信号以及降级或缓解决策。

核心规则

优先使用 browser 获取页面真实内容，使用 web_fetch 进行低成本探测。

- 当已知站点具有稳定的公开文本页面时，优先使用 web_fetch
当站点是动态的、以信息披露驱动的，或渲染后的形式明显更强时，优先使用 browser
如果两者都失败，则报告该站点受限或缺失，而不是假装已覆盖
不要将反爬虫代码、免责声明、外壳页面或登录墙视为可用内容

工作流程

1. 从正确的页面类型开始

- 优先选择固定入口点、列表页面、搜索页面、信息披露页面、快讯流和个股详情页面
不要仅从巨潮资讯的首页文本进行判断
当存在更好的内页时，不要依赖嘈杂的门户首页

2. 探测并分类访问结果

将每次探测结果归入以下类别之一：

- 可用：可读且内容实质充分
部分可用：部分内容真实，但明显不完整
仅外壳：主要是导航、脚本、免责声明或模板内容
被拦截：反爬虫、登录墙或无意义的内容

3. 选择提取模式

为每个站点或页面使用以下判定之一：

- 优先fetch
优先browser
受限
不可用

4. 保持站点角色清晰

- 巨潮资讯：官方确认和信息披露验证
东方财富：公开聚合、数据中心导航和准结构化市场页面
财联社：快速市场事件和快讯流
韭研公社：主题逻辑、时间线和社区线索发现
雪球：情绪、热度、个股详情快照和社区讨论

5. 在总结前进行标准化

当任务不仅仅是单次爬取检查时，使用 references/fields.md 将发现结果转换为标准化记录。

最低标准化规范：

- 分配来源层级、可信度、内容类型和观点风险
尽可能将时间标准化为亚洲/上海时区
保守地标准化A股股票代码
去重重复的事件报道
将已确认的事实与市场声称和情绪区分开

6. 尽早应用降级规则

在决定是否降级、推迟或替换来源时，使用 references/risks.md。

默认降级行为：

- 需要登录或反爬虫内容 -> 受限
仅外壳或大量免责声明的结果 -> 切换入口点或切换工具
财联社快讯默认先保留列表正文；仅当列表被截断、需要规范URL或需要跳转原始来源时才进入详情
巨潮公告默认先保留列表元数据；仅当标题价值足够高值得提取正文时才追踪PDF，否则保留标题衍生摘要并标记PDF正文未提取
仅社区来源的声称未经确认 -> 作为线索保留，不作为事实
不可用的优先站点 -> 披露该情况并使用经批准的备用公开来源

默认站点优先级

当任务未指定场景时，按此顺序进行稳定的公开采集：

1. 东方财富
财联社
巨潮资讯
韭研公社
雪球

此顺序反映公开可访问性和提取稳定性，而非市场重要性。

何时请求更强的访问权限

仅当用户明确希望从受限站点（尤其是雪球）获得更好的提取效果时，才请求更强的访问权限。

示例：

- 附加的Chrome中继标签页
已登录的浏览器配置文件
Cookie或经过身份验证的环境
专用爬虫或站点特定脚本

场景调用契约

当定时任务或调用方指定以下场景ID之一时，将其视为紧凑的指令包，无需请求更长的提示：

- 盘前：阅读 references/entrypoints.md、references/fields.md 和 references/risks.md；使用盘前优先级顺序；关注隔夜宏观或海外联动、政策或行业催化剂、重要公告、预期热门板块和今日关注清单
午间：阅读 references/entrypoints.md、references/fields.md 和 references/risks.md；使用盘中优先级顺序；关注上午指数和成交额快照、领涨或领跌主题、风格或情绪转变、有催化剂的活跃股票以及与盘前设定的偏差
尾盘：阅读 references/entrypoints.md、references/fields.md 和 references/risks.md；使用盘中优先级顺序；关注下午主线是否加强或轮动、尾盘异常、资金回流方向、热门股持续性以及可能影响收盘后复盘或次日预期的信号
收盘后：阅读 references/entrypoints.md、references/fields.md 和 references/risks.md；使用收盘后优先级顺序；关注指数和成交额回顾、主线复盘、关键股票和驱动因素、重要公告以及交易所或监管动态、次日线索和风险

对于每个场景：

- 输出使用中文，先给出结论再展开细节
保持已确认事实、市场观点与情绪和待核实线索清晰分离
在最终输出中包含本轮缺失站点和来源层级说明
每一轮都绑定入口点、字段标准化和风险降级规则，而非自由总结
不输出买入或卖出建议

标准输出

当生成正式的一轮输出时，始终至少包含以下部分的结构：

- 已确认事实
市场观点与情绪
待核实线索
本轮缺失站点
来源层级说明

各部分使用方式如下：

- 已确认事实：仅包含T1或充分支持的T2项目，或明确标记为部分确认的项目
市场观点与情绪：T3讨论、热度、共识漂移和情绪信号
待核实线索：谣言、单一来源社区声称、部分线索或矛盾陈述
本轮缺失站点：被拦截、不稳定、需要登录或以其他方式未覆盖的优先站点以及使用的备用方案
来源层级说明：解释T1/T2/T3的使用，并提醒读者社区来源不等同于正式披露

针对可爬取性任务的每个站点快速输出

当任务专门针对站点可行性而非市场总结时，返回：

- 站点
状态
推荐模式
最佳入口页面
有效内容
主要限制
下一步

不可妥协事项

- 区分已确认事实与社区观点
优先选择官方披露和高置信度的公开报道，而非讨论区
不输出买入/卖出建议
当优先站点失败或无法访问时，不暗示已全面覆盖

a-share-site-crawlA股站点爬取