FASTQC Report Interpreter
Analyze FASTQC quality control reports for Next-Generation Sequencing (NGS) data to assess data quality and identify issues.
Quick Start
CODEBLOCK0
Core Capabilities
1. Quality Metrics Analysis
CODEBLOCK1
Key Metrics:
| Metric | Good | Warning | Fail |
|---|
| Per base sequence quality | Q > 28 | Q 20-28 | Q < 20 |
| Per sequence quality scores |
Peak at Q30 | Peak Q20-30 | Peak < Q20 |
| Per base N content | < 5% | 5-20% | > 20% |
| Sequence duplication | < 20% | 20-50% | > 50% |
| Adapter content | < 5% | 5-10% | > 10% |
2. Issue Diagnosis
CODEBLOCK2
Common Issues:
Low Quality at Read Ends
- - Cause: Phasing effects, reagent depletion
- Solution: Trim last 10-20 bases
Adapter Contamination
- - Cause: Incomplete adapter removal
- Solution: Re-run cutadapt/Trimmomatic with stricter parameters
High Duplication
- - Cause: PCR over-amplification, low input
- Solution: Use deduplication; consider library prep optimization
Per Base Sequence Content Bias
- - Cause: Adapter dimers, non-random priming
- Solution: Check for adapter contamination; randomize primers
3. Batch Analysis
CODEBLOCK3
4. Recommendation Generation
CODEBLOCK4
Application-Specific Thresholds:
- - RNA-seq: Acceptable duplication up to 40% (transcript abundance)
- DNA-seq: Strict quality requirements (variant calling)
- ChIP-seq: Moderate quality, focus on enrichment metrics
CLI Usage
CODEBLOCK5
Output Interpretation
PASS (Green): Proceed with analysis
WARNING (Yellow): Review but likely acceptable
FAIL (Red): Requires action before downstream analysis
Troubleshooting Guide
See references/troubleshooting.md for:
- - Platform-specific issues (Illumina, PacBio, Oxford Nanopore)
- Library prep problem diagnosis
- Downstream analysis impact assessment
Skill ID: 205 |
Version: 1.0 |
License: MIT
技能名称: fastqc-report-interpreter
详细描述:
FASTQC 报告解读器
分析新一代测序(NGS)数据的FASTQC质量控制报告,以评估数据质量并识别问题。
快速开始
python
from scripts.fastqc_interpreter import FASTQCInterpreter
interpreter = FASTQCInterpreter()
分析报告
analysis = interpreter.analyze(sample_fastqc.html)
print(f整体质量: {analysis.quality_status})
print(f发现的问题: {analysis.issues})
核心功能
1. 质量指标分析
python
metrics = interpreter.parsemetrics(fastqcdata.txt)
关键指标:
| 指标 | 良好 | 警告 | 失败 |
|---|
| 每碱基序列质量 | Q > 28 | Q 20-28 | Q < 20 |
| 每条序列质量得分 |
峰值在Q30 | 峰值Q20-30 | 峰值 < Q20 |
| 每碱基N含量 | < 5% | 5-20% | > 20% |
| 序列重复率 | < 20% | 20-50% | > 50% |
| 接头含量 | < 5% | 5-10% | > 10% |
2. 问题诊断
python
issues = interpreter.diagnose_issues(metrics)
for issue in issues:
print(f{issue.severity}: {issue.description})
print(f建议: {issue.recommendation})
常见问题:
读段末端低质量
- - 原因: 相位效应、试剂消耗
- 解决方案: 修剪最后10-20个碱基
接头污染
- - 原因: 接头去除不彻底
- 解决方案: 使用更严格的参数重新运行cutadapt/Trimmomatic
高重复率
- - 原因: PCR过度扩增、起始量低
- 解决方案: 使用去重复工具;考虑优化文库制备
每碱基序列含量偏差
- - 原因: 接头二聚体、非随机引物
- 解决方案: 检查接头污染;随机化引物
3. 批量分析
python
batchresults = interpreter.analyzebatch(
fastqcfiles=[sample1fastqc.html, sample2_fastqc.html, ...],
outputsummary=batchsummary.csv
)
4. 建议生成
python
recommendations = interpreter.get_recommendations(
analysis,
application=rnaseq, # 或 dnaseq, chip_seq
quality_threshold=high
)
特定应用阈值:
- - RNA-seq: 可接受重复率高达40%(转录本丰度)
- DNA-seq: 严格质量要求(变异检测)
- ChIP-seq: 中等质量,重点关注富集指标
命令行使用
bash
分析单个报告
python scripts/fastqc
interpreter.py --input samplefastqc.html
批量分析
python scripts/fastqc_interpreter.py --batch *fastqc.html --output report.pdf
使用自定义阈值
python scripts/fastqc
interpreter.py --input fastqc.html --application rnaseq
输出解读
通过(绿色): 可继续分析
警告(黄色): 需检查但通常可接受
失败(红色): 在下游分析前需要处理
故障排除指南
参见 references/troubleshooting.md 了解:
- - 平台特定问题(Illumina、PacBio、Oxford Nanopore)
- 文库制备问题诊断
- 下游分析影响评估
技能ID: 205 |
版本: 1.0 |
许可证: MIT