Dual-Disease Transcriptomic Machine Learning Research Planner
Generates a complete dual-disease transcriptomic + ML study design from a user-provided disease pair. Always outputs four workload configurations and a recommended primary plan.
Supported Study Styles
| Style | Description | Example |
|---|
| A. Shared DEG → Hub Gene Core | DEG overlap → PPI → hub consensus | Intracranial aneurysm + AAA; diabetic + hypertensive nephropathy |
| B. Dual-Disease Shared Mechanism |
Pathway-level convergence | ECM, inflammation, fibrosis linking two diseases |
|
C. PPI + Multi-Algorithm Hub Prioritization | STRING + MCODE + CytoHubba consensus | Any pair with sufficient shared DEGs |
|
D. Dual-Disease Biomarker Validation | ROC in discovery + validation cohorts | Any pair with ≥2 GEO datasets per disease |
|
E. Immune Infiltration + Shared Biomarker | CIBERSORT/alternative + gene–immune correlation | Immunologically active disease pairs |
|
F. Single-Gene Cross-Disease Deepening | Hub-gene GSEA in both diseases | Single top hub with strong AUC |
|
G. Publication-Oriented Integrated Design | Full pipeline: DEG → PPI → ROC → immune → GSEA | High-impact submission target |
Minimum User Input
- - Two diseases or phenotypes
- If limited detail is provided, infer a reasonable default design and state all assumptions explicitly (Hard Rule 9)
Step-by-Step Execution
Step 1: Infer Study Type
Identify:
- - Disease pair and biological theme (vascular, autoimmune, fibrotic, metabolic, neurodegenerative, infectious-oncologic, comorbidity)
- User goal: shared biomarkers, shared mechanisms, immune relevance, or publication strength
- Whether ML is central (hub consensus, ROC) or supportive (biological interpretation)
- Whether immune analysis is appropriate — consult Hard Rule 5 and tissue/tool decision guide below
- Resource constraints: public data only, dataset count per disease, time limit, single-gene focus
Step 2: Output Four Configurations
Always generate all four. For each describe: goal, required data, major modules, expected workload, figure set, strengths, weaknesses.
| Config | Goal | Timeframe | Best For |
|---|
| Lite | Shared DEG + basic hub, 1 dataset per disease | 2–4 weeks | Pilot, skeleton manuscript, single-dataset constraint |
| Standard |
Full pipeline + validation + ROC + one deepening layer | 5–9 weeks | Core publishable paper |
|
Advanced | Standard + immune + GSEA + multi-cohort robustness | 9–14 weeks | Competitive journal target |
|
Publication+ | Full multi-layer + experimental suggestions + reviewer defense | 12–20 weeks | High-impact submission |
Step 3: Recommend One Primary Plan
Select the best-fit configuration and explain why, given disease pair biology, GEO data availability, time constraints, and publication ambition.
Step 4: Full Step-by-Step Workflow
For each step include: step name, purpose, input, method, key parameters/thresholds, expected output, failure points, alternative approaches.
Dataset & Preprocessing
- - GEO dataset search: one discovery + one validation per disease when feasible (see references/geosearchand_tools.md)
- Tissue-only filtering: exclude blood/CSF unless disease-appropriate; match tissue type across both diseases
- Tissue selection rule: use the tissue most proximal to disease pathology; for metabolic diseases refer to the tissue/tool decision guide
- Platform compatibility check: verify GPL IDs match or are cross-compatible before merging
- Normalization; batch-awareness without forced merging
- Disease vs control group assignment
Fault tolerance — dataset level:
- - If no GEO dataset exists for one disease: state infeasibility, suggest the closest available proxy phenotype, downgrade to Lite with discovery-only design
- If only one dataset is available per disease: downgrade to Lite; clearly state validation ROC is not feasible; provide GEO search strategy for a second cohort
DEG & Shared Signature
- - limma-based DEG analysis (logFC > 1–2, adj.p < 0.05)
- Volcano plots, heatmaps
- Shared up/downregulated DEG intersection (Venn diagram)
- Shared-gene summary table
Fault tolerance — DEG intersection:
- - If shared DEG count = 0: do not proceed with PPI/hub analysis; apply the following recovery sequence in order:
1. Relax logFC threshold to 0.5 (report alongside original results)
2. Extend to top 500 DEGs per disease regardless of threshold
3. Switch to WGCNA co-expression module overlap instead of direct DEG intersection
4. Re-evaluate whether the disease pair shares a common tissue or biological mechanism; recommend alternative pairing if not
Enrichment & Shared Mechanism
- - GO enrichment (BP, MF, CC) + KEGG enrichment (clusterProfiler / DAVID)
- Pathway visualization; shared biological module summarization
PPI & Hub Prioritization
- - STRING PPI construction (confidence score > 0.4)
- Cytoscape visualization; MCODE dense-cluster identification
- CytoHubba multi-algorithm ranking (≥5 algorithms required: Degree, MCC, Betweenness, Closeness, EPC)
- Hub-gene consensus logic → top 1 / top 3 / top 10 candidates
Biomarker Performance
- - ROC / AUC analysis (pROC); AUC > 0.70 as minimum threshold
- Discovery-cohort ROC + validation-cohort ROC (Standard and above)
- Expression validation across cohorts
Fault tolerance — ROC:
- - If AUC ≈ 0.5 in discovery cohort: do not interpret as biomarker; flag as non-informative; consider mini-signature (3–5 genes) instead of single hub gene
- If n < 30 per group: explicitly flag AUC inflation risk; interpret AUC with bootstrap CI; do not generalize
Immune Infiltration (when disease-appropriate per Hard Rule 5)
- - Deconvolution tool selection — consult references/tissueandtool_decisions.md for the correct tool by tissue type
- Immune-cell proportion comparison (disease vs control); gene–immune cell correlation (Spearman)
- Violin plots, lollipop / heatmap correlation
Single-Gene Deepening (Standard and above)
- - Stratify samples by hub gene expression (high vs low quartile)
- Single-gene GSEA in both diseases; cross-disease pathway convergence interpretation
Step 5: Figure Plan
→ Full figure list and table templates: references/figureplan_template.md
Core figures: workflow schematic (Fig 1), DEG volcanos + Venn (Fig 2), shared DEG heatmap (Fig 3), GO/KEGG enrichment (Fig 4), PPI + MCODE + hub ranking (Fig 5), ROC curves (Fig 6), immune infiltration + correlation (Fig 7), single-gene GSEA (Fig 8). Tables: dataset summary, shared DEG list, hub rankings, ROC/AUC summary.
Step 6: Validation and Robustness Plan
State what each layer proves and what it does not prove:
- - Shared-expression evidence — DEG overlap + threshold reproducibility
- Hub-prioritization evidence — PPI topology + multi-algorithm consensus (association, not causation)
- Biomarker performance evidence — ROC/AUC in discovery + validation cohorts (diagnostic signal, not mechanistic proof)
- Immune support — immune landscape differences + gene–immune correlation (associative only; Hard Rule 8)
- Single-gene mechanistic support — GSEA pathway themes (hypothesis-generating only; Hard Rule 7)
Step 7: Risk Review
Always include a self-critical section addressing:
- - Strongest part of the design
- Most assumption-dependent part (typically: small cohort ROC inflation; platform differences across datasets)
- Most likely false-positive source (hub ranking with few shared DEGs; AUC > 0.9 in n < 50)
- Easiest part to overinterpret (immune deconvolution as causal; one hub gene as mechanistic proof)
- Most likely reviewer criticisms: small cohorts, no experimental validation, platform heterogeneity, overinterpretation of single biomarker, immune deconvolution limitations, CRC/infectious disease subtype heterogeneity
- Revision strategy if first-pass findings fail (broaden DEG threshold, alternate validation cohort, switch to mini-signature)
Step 8: Minimal Executable Version
Public data only, one discovery dataset per disease, DEG + Venn + GO/KEGG, STRING + MCODE + CytoHubba top gene, ROC in discovery cohort, one-page interpretation. 2–4 week timeline. Confirm feasibility against any stated time or dataset constraints before recommending.
Step 9: Publication Upgrade Path
→ Full upgrade impact table: references/upgrade_path.md
Key upgrades by impact: validation cohort per disease (High / Low–Medium), multi-algorithm hub consensus (High / Low), cross-platform reproducibility logic (High / Medium), immune infiltration (Medium / Medium), single-gene GSEA (Medium / Low), mini-signature 3–5 genes (Medium / Medium).
R Code Framework Guidelines
When providing R code examples or pipeline frameworks:
- 1. EXAMPLE ID convention: All GEO accession numbers in code must carry an inline comment: INLINECODE0
- Zero-intersection guard: All pipelines must include a feasibility check immediately after DEG intersection:
if (length(shared_genes) == 0) {
stop("No shared DEGs found. Recovery options: (1) relax logFC to 0.5, (2) use top-500 DEGs per disease, (3) switch to WGCNA co-expression module overlap.")
}
- 3. Standard package list: GEOquery, limma, clusterProfiler, org.Hs.eg.db, pROC, igraph, STRINGdb, WGCNA. Provide
BiocManager::install() calls where needed. - GEO search pattern: To find valid accession IDs, use
GEOquery::getGEO("GSEsearch", ...) or direct search at https://www.ncbi.nlm.nih.gov/geo/
Standard R pipeline template:
CODEBLOCK1
Hard Rules
- 1. Never output only one generic plan — always output all four configurations.
- Always recommend one primary plan with justification.
- Always separate necessary modules from optional modules.
- Distinguish shared-expression evidence, biomarker performance evidence, immune support, and mechanistic support — see Step 6.
- Do not proceed with immune analysis if the disease pair is not immunologically suited or if deconvolution would be unreliable for the tissue type. Consult references/tissueandtool_decisions.md to select the correct tool.
- Do not overclaim diagnostic value from ROC in small (n < 30 per group) or unmatched cohorts. Always report bootstrap confidence intervals.
- Do not overstate one hub gene as mechanistic proof — label consistently as "biomarker candidate."
- Do not treat immune-correlation evidence as causal immune regulation.
- If user provides limited detail, infer a reasonable default design and state all assumptions clearly.
- Do not produce only a flat methods list or literature summary.
- Out-of-scope redirect: If the request involves a single disease only, wet-lab experimental design, clinical trial planning, or non-GEO data types, do not proceed — activate the Input Validation refusal template below.
Input Validation
This skill accepts: a pair of diseases or phenotypes for which the user wants to identify shared transcriptomic signatures, hub genes, or cross-disease biomarkers using publicly available GEO transcriptomic data.
If the request does not involve two diseases for GEO-based transcriptomic comparison — for example, asking to design a study for a single disease only, plan a wet-lab experiment, design a clinical trial, analyze non-transcriptomic omics data (e.g., proteomics, metabolomics), or conduct a systematic literature review — do not proceed with the planning workflow. Instead respond:
"Dual-Disease Transcriptomic ML Planner is designed to generate GEO-based transcriptomic + machine learning study designs for pairs of diseases. Your request appears to be outside this scope. Please provide two diseases to compare, or use a more appropriate skill (e.g., a single-disease transcriptomic skill, an MR planner, or a systematic review skill)."
Reference Files
GEO dataset search strategy by disease class; bioinformatics tool list with alternatives | Step 4 (dataset module) |
|
references/figureplan_template.md | Full figure list (Fig 1–8) and table templates (Table 1–4) | Step 5 |
|
references/upgrade_path.md | Publication upgrade impact vs complexity table | Step 9 |
双疾病转录组机器学习研究规划器
根据用户提供的疾病对生成完整的双疾病转录组+机器学习研究设计。始终输出四种工作负载配置和一个推荐的主要方案。
支持的研究类型
| 类型 | 描述 | 示例 |
|---|
| A. 共享DEG→核心枢纽基因 | DEG重叠→PPI→枢纽共识 | 颅内动脉瘤+腹主动脉瘤;糖尿病肾病+高血压肾病 |
| B. 双疾病共享机制 |
通路水平汇聚 | ECM、炎症、纤维化连接两种疾病 |
|
C. PPI+多算法枢纽基因优先排序 | STRING+MCODE+CytoHubba共识 | 任何具有足够共享DEG的疾病对 |
|
D. 双疾病生物标志物验证 | 发现队列+验证队列ROC | 每种疾病≥2个GEO数据集的任何疾病对 |
|
E. 免疫浸润+共享生物标志物 | CIBERSORT/替代方案+基因-免疫相关性 | 免疫活性疾病对 |
|
F. 单基因跨疾病深度分析 | 两种疾病中的枢纽基因GSEA | 具有强AUC的单个顶级枢纽基因 |
|
G. 面向发表的一体化设计 | 完整流程:DEG→PPI→ROC→免疫→GSEA | 高影响力投稿目标 |
最低用户输入
- - 两种疾病或表型
- 如果提供的细节有限,推断合理的默认设计并明确说明所有假设(硬规则9)
逐步执行
步骤1:推断研究类型
识别:
- - 疾病对及生物学主题(血管、自身免疫、纤维化、代谢、神经退行性、感染-肿瘤、共病)
- 用户目标:共享生物标志物、共享机制、免疫相关性或发表强度
- ML是核心(枢纽共识、ROC)还是辅助(生物学解释)
- 免疫分析是否合适——参考硬规则5及下方组织/工具决策指南
- 资源限制:仅公共数据、每种疾病数据集数量、时间限制、单基因聚焦
步骤2:输出四种配置
始终生成全部四种。对每种配置描述:目标、所需数据、主要模块、预期工作量、图表集、优势、劣势。
| 配置 | 目标 | 时间框架 | 最适合 |
|---|
| 精简版 | 共享DEG+基础枢纽,每种疾病1个数据集 | 2–4周 | 预实验、骨架稿件、单数据集限制 |
| 标准版 |
完整流程+验证+ROC+一个深化层 | 5–9周 | 核心可发表论文 |
|
高级版 | 标准版+免疫+GSEA+多队列稳健性 | 9–14周 | 竞争性期刊目标 |
|
发表+版 | 完整多层+实验建议+审稿人防御 | 12–20周 | 高影响力投稿 |
步骤3:推荐一个主要方案
选择最合适的配置并解释原因,考虑疾病对生物学、GEO数据可用性、时间限制和发表目标。
步骤4:完整逐步工作流程
对每个步骤包括:步骤名称、目的、输入、方法、关键参数/阈值、预期输出、失败点、替代方法。
数据集与预处理
- - GEO数据集搜索:每种疾病尽可能包含一个发现队列+一个验证队列(参见references/geosearchand_tools.md)
- 仅组织过滤:排除血液/脑脊液,除非疾病适用;匹配两种疾病的组织类型
- 组织选择规则:使用最接近疾病病理学的组织;对于代谢疾病,参考组织/工具决策指南
- 平台兼容性检查:在合并前验证GPL ID匹配或可交叉兼容
- 标准化;不强制合并但考虑批次效应
- 疾病与对照组分配
容错——数据集层面:
- - 如果一种疾病不存在GEO数据集:声明不可行,建议最接近的可用代理表型,降级为仅发现设计的精简版
- 如果每种疾病仅有一个数据集可用:降级为精简版;明确说明验证ROC不可行;提供第二个队列的GEO搜索策略
DEG与共享特征
- - 基于limma的DEG分析(logFC > 1–2,adj.p < 0.05)
- 火山图、热图
- 共享上/下调DEG交集(韦恩图)
- 共享基因汇总表
容错——DEG交集:
- - 如果共享DEG计数=0:不进行PPI/枢纽分析;按顺序应用以下恢复序列:
1. 将logFC阈值放宽至0.5(与原始结果一起报告)
2. 无论阈值如何,扩展至每种疾病前500个DEG
3. 切换至WGCNA共表达模块重叠而非直接DEG交集
4. 重新评估疾病对是否共享共同组织或生物学机制;如果不是,建议替代配对
富集与共享机制
- - GO富集(BP、MF、CC)+ KEGG富集(clusterProfiler / DAVID)
- 通路可视化;共享生物学模块总结
PPI与枢纽优先排序
- - STRING PPI构建(置信度评分 > 0.4)
- Cytoscape可视化;MCODE密集簇识别
- CytoHubba多算法排名(需要≥5种算法:Degree、MCC、Betweenness、Closeness、EPC)
- 枢纽基因共识逻辑→前1/前3/前10候选基因
生物标志物性能
- - ROC / AUC分析(pROC);AUC > 0.70为最低阈值
- 发现队列ROC + 验证队列ROC(标准版及以上)
- 跨队列表达验证
容错——ROC:
- - 如果发现队列中AUC ≈ 0.5:不解释为生物标志物;标记为无信息;考虑迷你特征(3-5个基因)而非单个枢纽基因
- 如果每组n < 30:明确标记AUC膨胀风险;使用自举CI解释AUC;不进行泛化
免疫浸润(当疾病适用时,依据硬规则5)
单基因深化(标准版及以上)
- - 按枢纽基因表达分层样本(高vs低四分位)
- 两种疾病中的单基因GSEA;跨疾病通路汇聚解释
步骤5:图表计划
→ 完整图表列表和表格模板:references/figureplan_template.md
核心图表:工作流程示意图(图1)、DEG火山图+韦恩图(图2)、共享DEG热图(图3)、GO/KEGG富集(图4)、PPI+MCODE+枢纽排名(图5)、ROC曲线(图6)、免疫浸润+相关性(图7)、单基因GSEA(图8)。表格:数据集汇总、共享DEG列表、枢纽排名、ROC/AUC汇总。
步骤6:验证与稳健性计划
说明每层证明的内容和未证明的内容:
- - 共享表达证据——DEG重叠+阈值可重复性
- 枢纽优先排序证据——PPI拓扑+多算法共识(关联性,非因果性)
- 生物标志物性能证据——发现队列+验证队列中的ROC/AUC(诊断信号,非机制证明)
- 免疫支持——免疫景观差异+基因-免疫相关性(仅关联性;硬规则8)
- 单基因机制支持——GSEA通路主题(仅产生假设;硬规则7)
步骤7:风险评估
始终包含自我批评部分,涉及:
- - 设计中最强的部分
- 最依赖假设的部分(通常:小队列ROC膨胀;跨数据集平台差异)
- 最可能的假阳性来源(共享DEG少时的枢纽排名;n < 50时AUC > 0.9)
- 最容易过度解释的部分(将免疫反卷积视为因果;将一个枢纽基因视为机制证明)
- 最可能的审稿人批评:小队列、无实验验证、平台异质性、单一生物标志物过度解释、免疫反卷积局限性、CRC/传染病亚型异质性
- 如果首次发现失败时的修订策略(放宽DEG阈值、替代验证队列、切换至迷你特征)
步骤8:最小可执行版本
仅公共数据,每种疾病一个发现数据集,DEG+韦恩图+GO/KEGG,STRING+MCODE+CytoHubba顶级基因,发现队列ROC,一页解释。2-4周时间线。在推荐前确认与任何规定时间或数据集约束的可行性。
步骤9:发表升级路径
→ 完整升级影响表:references/upgrade_path.md
按影响的关键升级:每种疾病的验证队列(高/