Clinical Data Cleaner
Clean, validate, and standardize clinical trial data to meet CDISC SDTM standards for regulatory submissions to FDA or EMA.
Quick Start
CODEBLOCK0
Core Capabilities
1. SDTM Domain Validation
CODEBLOCK1
Required Fields:
- - DM: STUDYID, USUBJID, SUBJID, RFSTDTC, RFENDTC, SITEID, AGE, SEX, RACE
- LB: STUDYID, USUBJID, LBTESTCD, LBCAT, LBORRES, LBORRESU, LBSTRESC, LBDTC
- VS: STUDYID, USUBJID, VSTESTCD, VSORRES, VSORRESU, VSSTRESC, VSDTC
2. Missing Value Handling
CODEBLOCK2
3. Outlier Detection
CODEBLOCK3
Clinical Thresholds:
| Parameter | Range | Unit |
|---|
| Glucose | 50-500 | mg/dL |
| Hemoglobin |
5-20 | g/dL |
| Systolic BP | 70-220 | mmHg |
4. Date Standardization
CODEBLOCK4
5. Complete Pipeline
CODEBLOCK5
Output Files:
- -
output.csv - Cleaned SDTM data - INLINECODE1 - Audit trail for regulatory submission
CLI Usage
CODEBLOCK6
Common Patterns
See references/common-patterns.md for detailed examples:
- - Regulatory Submission Preparation
- Interim Analysis Data Preparation
- Database Migration Cleanup
- External Lab Data Integration
Troubleshooting
See references/troubleshooting.md for solutions to:
- - Validation failures
- Date parsing errors
- Memory errors with large datasets
- Outlier detection issues
Quality Checklist
Pre-Cleaning:
- - [ ] IACUC approval obtained (animal studies)
- [ ] Sample size adequately powered
- [ ] Randomization method documented
Post-Cleaning:
- - [ ] Validate against CDISC SDTM IG
- [ ] Review all cleaning actions in audit trail
- [ ] Test import to analysis software
References
- -
references/sdtm_ig_guide.md - CDISC SDTM Implementation Guide - INLINECODE3 - Domain-specific field requirements
- INLINECODE4 - Clinical outlier thresholds
- INLINECODE5 - Detailed usage patterns
- INLINECODE6 - Problem-solving guide
Skill ID: 189 |
Version: 2.0 |
License: MIT
临床数据清理器
清理、验证并标准化临床试验数据,使其符合CDISC SDTM标准,以便向FDA或EMA提交监管申请。
快速开始
python
from scripts.main import ClinicalDataCleaner
初始化人口学领域
cleaner = ClinicalDataCleaner(domain=DM)
使用默认设置清理数据
cleaned = cleaner.clean(raw_data)
保存并附带审计追踪
cleaner.save_report(output.csv)
核心功能
1. SDTM领域验证
python
cleaner = ClinicalDataCleaner(domain=DM) # 或 LB, VS
isvalid, missing = cleaner.validatedomain(data)
必填字段:
- - DM:STUDYID、USUBJID、SUBJID、RFSTDTC、RFENDTC、SITEID、AGE、SEX、RACE
- LB:STUDYID、USUBJID、LBTESTCD、LBCAT、LBORRES、LBORRESU、LBSTRESC、LBDTC
- VS:STUDYID、USUBJID、VSTESTCD、VSORRES、VSORRESU、VSSTRESC、VSDTC
2. 缺失值处理
python
cleaner = ClinicalDataCleaner(
domain=DM,
missing_strategy=median # mean、median、mode、forward、drop
)
cleaned = cleaner.handlemissingvalues(data)
3. 异常值检测
python
cleaner = ClinicalDataCleaner(
domain=LB,
outlier_method=domain, # iqr、zscore、domain
outlier_action=flag # flag、remove、cap
)
flagged = cleaner.detect_outliers(data)
临床阈值:
5-20 | g/dL |
| 收缩压 | 70-220 | mmHg |
4. 日期标准化
python
standardized = cleaner.standardize_dates(data)
转换为ISO 8601格式:2023-01-15T09:30:00
5. 完整流程
python
cleaner = ClinicalDataCleaner(
domain=DM,
missing_strategy=median,
outlier_method=iqr,
outlier_action=flag
)
cleaned_data = cleaner.clean(data)
cleaner.save_report(output.csv)
输出文件:
- - output.csv - 清理后的SDTM数据
- output.report.json - 用于监管提交的审计追踪
命令行使用
bash
清理人口学数据
python scripts/main.py \
--input dm_raw.csv \
--domain DM \
--output dm_clean.csv \
--missing-strategy median \
--outlier-method iqr \
--outlier-action flag
使用临床阈值清理实验室数据
python scripts/main.py \
--input lb_raw.csv \
--domain LB \
--output lb_clean.csv \
--outlier-method domain
常见模式
详见 references/common-patterns.md 中的详细示例:
- - 监管提交准备
- 中期分析数据准备
- 数据库迁移清理
- 外部实验室数据整合
故障排除
详见 references/troubleshooting.md 中的解决方案:
- - 验证失败
- 日期解析错误
- 大数据集内存错误
- 异常值检测问题
质量检查清单
清理前:
- - [ ] 获得IACUC批准(动物研究)
- [ ] 样本量具有足够统计效力
- [ ] 随机化方法已记录
清理后:
- - [ ] 对照CDISC SDTM IG进行验证
- [ ] 审查审计追踪中的所有清理操作
- [ ] 测试导入分析软件
参考资料
- - references/sdtmigguide.md - CDISC SDTM实施指南
- references/domainspecs.json - 领域特定字段要求
- references/outlierthresholds.json - 临床异常值阈值
- references/common-patterns.md - 详细使用模式
- references/troubleshooting.md - 问题解决指南
技能ID:189 |
版本:2.0 |
许可证:MIT