HIPAA Compliance Auditor
A clinical-grade PII/PHI detection and de-identification tool for healthcare text data.
Overview
This skill analyzes text for HIPAA-protected identifiers and automatically redacts or anonymizes them. It uses a combination of regex patterns, NLP entity recognition, and contextual analysis to identify 18 HIPAA identifier categories.
Features
- - 18 HIPAA Identifiers Detection: Names, dates, SSN, MRN, phone/fax, email, geographic data, etc.
- Automatic De-identification: Replace PII with semantic tokens (e.g.,
[PATIENT_NAME], [DATE_1]) - Context-Aware Detection: Distinguishes between similar patterns (dates vs. lab values)
- Audit Logging: Track all redaction actions for compliance documentation
- Confidence Scoring: Flag uncertain detections for manual review
Usage
Command Line
CODEBLOCK0
Python API
CODEBLOCK1
Parameters
| Parameter | Type | Default | Required | Description |
|---|
| INLINECODE2 , INLINECODE3 | string | - | No | Path to input text file |
| INLINECODE4 |
string | - | No | Direct text input (alternative to file) |
|
--output,
-o | string | - | No | Path for de-identified output file |
|
--audit-log | string | - | No | Path for JSON audit log |
|
--confidence | float | 0.7 | No | Minimum confidence threshold (0.0-1.0) |
|
--preserve-structure | bool | true | No | Maintain document structure |
|
--custom-patterns | string | - | No | Path to custom regex patterns JSON |
HIPAA Identifier Categories Detected
- 1. Names (patient, relatives, employers)
- Geographic subdivisions smaller than state
- Dates (except year) related to individual
- Phone numbers
- Fax numbers
- Email addresses
- SSN
- Medical record numbers
- Health plan beneficiary numbers
- Account numbers
- Certificate/license numbers
- Vehicle identifiers
- Device identifiers
- URLs
- IP addresses
- Biometric identifiers
- Full-face photos
- Any other unique identifying numbers
Output Format
De-identified Text
Original identifiers replaced with semantic tags:
- -
[PATIENT_NAME_1], [PATIENT_NAME_2] ... - INLINECODE13 ,
[DATE_2] ... - INLINECODE15
- INLINECODE16 ,
[PHONE_2] ... - INLINECODE18
- INLINECODE19 (Medical Record Number)
- INLINECODE20
Audit Log JSON
CODEBLOCK2
Technical Architecture
- 1. Preprocessing: Normalize text encoding, handle line breaks
- Regex Engine: Pattern matching for structured identifiers (SSN, phone, email, MRN)
- NLP Pipeline: spaCy NER for names, organizations, locations
- Context Filter: Remove false positives (e.g., "Dr. Smith" vs. "smith fracture")
- Replacement Engine: Sequential replacement with semantic tokens
- Validation: Ensure no original PII remains in output
Dependencies
- - Python 3.9+
- spaCy (encorewebtrf or encoreweblg)
- regex (for advanced pattern matching)
- Presidio (optional, for enhanced PII detection)
See references/requirements.txt for full dependency list.
Limitations & Warnings
⚠️ CRITICAL: This tool is designed as a helper, not a replacement for human review.
- - Context-dependent PII (e.g., rare disease names + location) may not be fully detected
- Unstructured narrative text may contain identifying information not caught by patterns
- Always perform manual QA on output before HIPAA-compliant release
- AI Autonomous Acceptance Status: 需人工检查 (Requires Manual Review)
References
- -
references/hipaa_safe_harbor_guide.pdf - HIPAA Safe Harbor de-identification standards - INLINECODE23 - Complete regex pattern definitions
- INLINECODE24 - Sample clinical texts with expected outputs
- INLINECODE25 - Python dependencies
Technical Difficulty: High
Complex NLP pipelines, contextual disambiguation, regulatory compliance requirements.
Risk Assessment
| Risk Indicator | Assessment | Level |
|---|
| Code Execution | Python/R scripts executed locally | Medium |
| Network Access |
No external API calls | Low |
| File System Access | Read input files, write output files | Medium |
| Instruction Tampering | Standard prompt guidelines | Low |
| Data Exposure | Output files saved to workspace | Low |
Security Checklist
- - [ ] No hardcoded credentials or API keys
- [ ] No unauthorized file system access (../)
- [ ] Output does not expose sensitive information
- [ ] Prompt injection protections in place
- [ ] Input file paths validated (no ../ traversal)
- [ ] Output directory restricted to workspace
- [ ] Script execution in sandboxed environment
- [ ] Error messages sanitized (no stack traces exposed)
- [ ] Dependencies audited
Prerequisites
CODEBLOCK3
Evaluation Criteria
Success Metrics
- - [ ] Successfully executes main functionality
- [ ] Output meets quality standards
- [ ] Handles edge cases gracefully
- [ ] Performance is acceptable
Test Cases
- 1. Basic Functionality: Standard input → Expected output
- Edge Case: Invalid input → Graceful error handling
- Performance: Large dataset → Acceptable processing time
Lifecycle Status
- - Current Stage: Draft
- Next Review Date: 2026-03-06
- Known Issues: None
- Planned Improvements:
- Performance optimization
- Additional feature support
HIPAA合规审计器
用于医疗文本数据的临床级PII/PHI检测与去标识化工具。
概述
该技能分析文本中受HIPAA保护的标识符,并自动进行编辑或匿名化处理。它结合正则表达式模式、NLP实体识别和上下文分析,识别18类HIPAA标识符。
功能特点
- - 18类HIPAA标识符检测:姓名、日期、社保号、病历号、电话/传真、邮箱、地理数据等
- 自动去标识化:用语义标记替换PII(如[PATIENTNAME]、[DATE1])
- 上下文感知检测:区分相似模式(日期与化验值)
- 审计日志:追踪所有编辑操作,用于合规文档
- 置信度评分:标记不确定的检测结果供人工复核
使用方法
命令行
bash
python scripts/main.py --input patient_text.txt --output deidentified.txt
python scripts/main.py --text 患者张三,社保号123-45-6789... --audit-log audit.json
Python API
python
from scripts.main import HIPAAAuditor
auditor = HIPAAAuditor()
result = auditor.deidentify(患者张三于2024-01-15入院...)
print(result.cleaned_text) # 去标识化输出
print(result.detected_pii) # 检测到的PII实体列表
参数
| 参数 | 类型 | 默认值 | 必需 | 描述 |
|---|
| --input, -i | 字符串 | - | 否 | 输入文本文件路径 |
| --text |
字符串 | - | 否 | 直接文本输入(替代文件) |
| --output, -o | 字符串 | - | 否 | 去标识化输出文件路径 |
| --audit-log | 字符串 | - | 否 | JSON审计日志路径 |
| --confidence | 浮点数 | 0.7 | 否 | 最低置信度阈值(0.0-1.0) |
| --preserve-structure | 布尔值 | true | 否 | 保持文档结构 |
| --custom-patterns | 字符串 | - | 否 | 自定义正则表达式模式JSON路径 |
检测的HIPAA标识符类别
- 1. 姓名(患者、亲属、雇主)
- 小于州级的地理细分
- 与个人相关的日期(年份除外)
- 电话号码
- 传真号码
- 邮箱地址
- 社保号
- 病历号
- 健康计划受益人编号
- 账号
- 证书/执照编号
- 车辆标识符
- 设备标识符
- URL
- IP地址
- 生物识别标识符
- 全脸照片
- 任何其他唯一识别编号
输出格式
去标识化文本
原始标识符替换为语义标签:
- - [PATIENTNAME1], [PATIENTNAME2] ...
- [DATE1], [DATE2] ...
- [SSN1]
- [PHONE1], [PHONE2] ...
- [EMAIL1]
- [MRN1](病历号)
- [ADDRESS1]
审计日志JSON
json
{
timestamp: 2024-01-15T10:30:00Z,
input_hash: sha256:abc123...,
detections: [
{
type: PATIENT_NAME,
position: [10, 18],
confidence: 0.95,
replacement: [PATIENT
NAME1],
original_length: 8
}
],
statistics: {
total
piifound: 5,
categories_detected: [NAME, DATE, PHONE, SSN]
}
}
技术架构
- 1. 预处理:标准化文本编码,处理换行符
- 正则表达式引擎:结构化标识符的模式匹配(社保号、电话、邮箱、病历号)
- NLP流水线:spaCy命名实体识别,用于姓名、机构、地点
- 上下文过滤器:消除误报(如张医生与张氏骨折)
- 替换引擎:使用语义标记进行顺序替换
- 验证:确保输出中无原始PII残留
依赖项
- - Python 3.9+
- spaCy(encorewebtrf或encoreweblg)
- regex(用于高级模式匹配)
- Presidio(可选,用于增强PII检测)
完整依赖列表请参见references/requirements.txt。
限制与警告
⚠️ 重要提示:本工具设计为辅助工具,不能替代人工审核。
- - 上下文相关的PII(如罕见病名称+地点)可能无法完全检测
- 非结构化叙述文本可能包含模式无法捕获的识别信息
- 在HIPAA合规发布前,务必对输出进行人工质量检查
- AI自主接受状态:需人工检查
参考资料
- - references/hipaasafeharborguide.pdf - HIPAA安全港去标识化标准
- references/piipatterns.json - 完整正则表达式模式定义
- references/test_cases/ - 带预期输出的临床文本样本
- references/requirements.txt - Python依赖项
技术难度:高
复杂的NLP流水线、上下文消歧、法规合规要求。
风险评估
| 风险指标 | 评估 | 等级 |
|---|
| 代码执行 | 本地执行Python/R脚本 | 中 |
| 网络访问 |
无外部API调用 | 低 |
| 文件系统访问 | 读取输入文件,写入输出文件 | 中 |
| 指令篡改 | 标准提示词指南 | 低 |
| 数据泄露 | 输出文件保存到工作区 | 低 |
安全检查清单
- - [ ] 无硬编码凭据或API密钥
- [ ] 无未经授权的文件系统访问(../)
- [ ] 输出不暴露敏感信息
- [ ] 已实施提示注入防护
- [ ] 输入文件路径已验证(无../遍历)
- [ ] 输出目录限制在工作区内
- [ ] 脚本在沙盒环境中执行
- [ ] 错误消息已清理(不暴露堆栈跟踪)
- [ ] 依赖项已审计
前置条件
bash
Python依赖项
pip install -r requirements.txt
评估标准
成功指标
- - [ ] 成功执行主要功能
- [ ] 输出符合质量标准
- [ ] 优雅处理边界情况
- [ ] 性能可接受
测试用例
- 1. 基本功能:标准输入→预期输出
- 边界情况:无效输入→优雅错误处理
- 性能:大数据集→可接受的处理时间
生命周期状态
- - 当前阶段:草案
- 下次审核日期:2026-03-06
- 已知问题:无
- 计划改进:
- 性能优化
- 额外功能支持