HIPAA Compliance Auditor

A clinical-grade PII/PHI detection and de-identification tool for healthcare text data.

Overview

This skill analyzes text for HIPAA-protected identifiers and automatically redacts or anonymizes them. It uses a combination of regex patterns, NLP entity recognition, and contextual analysis to identify 18 HIPAA identifier categories.

Features

- 18 HIPAA Identifiers Detection: Names, dates, SSN, MRN, phone/fax, email, geographic data, etc.
Automatic De-identification: Replace PII with semantic tokens (e.g., [PATIENT_NAME], [DATE_1])
Context-Aware Detection: Distinguishes between similar patterns (dates vs. lab values)
Audit Logging: Track all redaction actions for compliance documentation
Confidence Scoring: Flag uncertain detections for manual review

Usage

Command Line

CODEBLOCK0

Python API

CODEBLOCK1

Parameters

Parameter	Type	Default	Required	Description
INLINECODE2, INLINECODE3	string	-	No	Path to input text file
INLINECODE4

Output Format

De-identified Text

Original identifiers replaced with semantic tags:

- [PATIENT_NAME_1], [PATIENT_NAME_2] ...
INLINECODE13, [DATE_2] ...
INLINECODE15
INLINECODE16, [PHONE_2] ...
INLINECODE18
INLINECODE19 (Medical Record Number)
INLINECODE20

Audit Log JSON

CODEBLOCK2

Technical Architecture

1. Preprocessing: Normalize text encoding, handle line breaks
Regex Engine: Pattern matching for structured identifiers (SSN, phone, email, MRN)
NLP Pipeline: spaCy NER for names, organizations, locations
Context Filter: Remove false positives (e.g., "Dr. Smith" vs. "smith fracture")
Replacement Engine: Sequential replacement with semantic tokens
Validation: Ensure no original PII remains in output

Dependencies

- Python 3.9+
spaCy (encorewebtrf or encoreweblg)
regex (for advanced pattern matching)
Presidio (optional, for enhanced PII detection)

See references/requirements.txt for full dependency list.

Limitations & Warnings

⚠️ CRITICAL: This tool is designed as a helper, not a replacement for human review.

- Context-dependent PII (e.g., rare disease names + location) may not be fully detected
Unstructured narrative text may contain identifying information not caught by patterns
Always perform manual QA on output before HIPAA-compliant release
AI Autonomous Acceptance Status: 需人工检查 (Requires Manual Review)

References

- references/hipaa_safe_harbor_guide.pdf - HIPAA Safe Harbor de-identification standards
INLINECODE23 - Complete regex pattern definitions
INLINECODE24 - Sample clinical texts with expected outputs
INLINECODE25 - Python dependencies

Technical Difficulty: High

Complex NLP pipelines, contextual disambiguation, regulatory compliance requirements.

Risk Assessment

Risk Indicator	Assessment	Level
Code Execution	Python/R scripts executed locally	Medium
Network Access

Security Checklist

- [ ] No hardcoded credentials or API keys
[ ] No unauthorized file system access (../)
[ ] Output does not expose sensitive information
[ ] Prompt injection protections in place
[ ] Input file paths validated (no ../ traversal)
[ ] Output directory restricted to workspace
[ ] Script execution in sandboxed environment
[ ] Error messages sanitized (no stack traces exposed)
[ ] Dependencies audited

Prerequisites

CODEBLOCK3

Evaluation Criteria

Success Metrics

- [ ] Successfully executes main functionality
[ ] Output meets quality standards
[ ] Handles edge cases gracefully
[ ] Performance is acceptable

Test Cases

1. Basic Functionality: Standard input → Expected output
Edge Case: Invalid input → Graceful error handling
Performance: Large dataset → Acceptable processing time

Lifecycle Status

- Current Stage: Draft
Next Review Date: 2026-03-06
Known Issues: None
Planned Improvements:

- Performance optimization - Additional feature support

HIPAA合规审计器

用于医疗文本数据的临床级PII/PHI检测与去标识化工具。

概述

该技能分析文本中受HIPAA保护的标识符，并自动进行编辑或匿名化处理。它结合正则表达式模式、NLP实体识别和上下文分析，识别18类HIPAA标识符。

功能特点

- 18类HIPAA标识符检测：姓名、日期、社保号、病历号、电话/传真、邮箱、地理数据等
自动去标识化：用语义标记替换PII（如[PATIENTNAME]、[DATE1]）
上下文感知检测：区分相似模式（日期与化验值）
审计日志：追踪所有编辑操作，用于合规文档
置信度评分：标记不确定的检测结果供人工复核

使用方法

命令行

bash python scripts/main.py --input patient_text.txt --output deidentified.txt python scripts/main.py --text 患者张三，社保号123-45-6789... --audit-log audit.json

Python API

python from scripts.main import HIPAAAuditor

auditor = HIPAAAuditor()
result = auditor.deidentify(患者张三于2024-01-15入院...)
print(result.cleaned_text) # 去标识化输出
print(result.detected_pii) # 检测到的PII实体列表

参数

参数	类型	默认值	必需	描述
--input, -i	字符串	-	否	输入文本文件路径
--text

字符串 | - | 否 | 直接文本输入（替代文件） | | --output, -o | 字符串 | - | 否 | 去标识化输出文件路径 | | --audit-log | 字符串 | - | 否 | JSON审计日志路径 | | --confidence | 浮点数 | 0.7 | 否 | 最低置信度阈值（0.0-1.0） | | --preserve-structure | 布尔值 | true | 否 | 保持文档结构 | | --custom-patterns | 字符串 | - | 否 | 自定义正则表达式模式JSON路径 |

检测的HIPAA标识符类别

1. 姓名（患者、亲属、雇主）
小于州级的地理细分
与个人相关的日期（年份除外）
电话号码
传真号码
邮箱地址
社保号
病历号
健康计划受益人编号
账号
证书/执照编号
车辆标识符
设备标识符
URL
IP地址
生物识别标识符
全脸照片
任何其他唯一识别编号

输出格式

去标识化文本

原始标识符替换为语义标签：

- [PATIENTNAME1], [PATIENTNAME2] ...
[DATE1], [DATE2] ...
[SSN1]
[PHONE1], [PHONE2] ...
[EMAIL1]
[MRN1]（病历号）
[ADDRESS1]

审计日志JSON

json { timestamp: 2024-01-15T10:30:00Z, input_hash: sha256:abc123..., detections: [ { type: PATIENT_NAME, position: [10, 18], confidence: 0.95, replacement: [PATIENTNAME1], original_length: 8 } ], statistics: { totalpiifound: 5, categories_detected: [NAME, DATE, PHONE, SSN] } }

技术架构

1. 预处理：标准化文本编码，处理换行符
正则表达式引擎：结构化标识符的模式匹配（社保号、电话、邮箱、病历号）
NLP流水线：spaCy命名实体识别，用于姓名、机构、地点
上下文过滤器：消除误报（如张医生与张氏骨折）
替换引擎：使用语义标记进行顺序替换
验证：确保输出中无原始PII残留

依赖项

- Python 3.9+
spaCy（encorewebtrf或encoreweblg）
regex（用于高级模式匹配）
Presidio（可选，用于增强PII检测）

完整依赖列表请参见references/requirements.txt。

限制与警告

⚠️ 重要提示：本工具设计为辅助工具，不能替代人工审核。

- 上下文相关的PII（如罕见病名称+地点）可能无法完全检测
非结构化叙述文本可能包含模式无法捕获的识别信息
在HIPAA合规发布前，务必对输出进行人工质量检查
AI自主接受状态：需人工检查

参考资料

- references/hipaasafeharborguide.pdf - HIPAA安全港去标识化标准
references/piipatterns.json - 完整正则表达式模式定义
references/test_cases/ - 带预期输出的临床文本样本
references/requirements.txt - Python依赖项

技术难度：高

复杂的NLP流水线、上下文消歧、法规合规要求。

风险评估

风险指标	评估	等级
代码执行	本地执行Python/R脚本	中
网络访问

安全检查清单

- [ ] 无硬编码凭据或API密钥
[ ] 无未经授权的文件系统访问（../）
[ ] 输出不暴露敏感信息
[ ] 已实施提示注入防护
[ ] 输入文件路径已验证（无../遍历）
[ ] 输出目录限制在工作区内
[ ] 脚本在沙盒环境中执行
[ ] 错误消息已清理（不暴露堆栈跟踪）
[ ] 依赖项已审计

前置条件

bash

Python依赖项

pip install -r requirements.txt

评估标准

成功指标

- [ ] 成功执行主要功能
[ ] 输出符合质量标准
[ ] 优雅处理边界情况
[ ] 性能可接受

测试用例

1. 基本功能：标准输入→预期输出
边界情况：无效输入→优雅错误处理
性能：大数据集→可接受的处理时间

生命周期状态

- 当前阶段：草案
下次审核日期：2026-03-06
已知问题：无
计划改进：

- 性能优化 - 额外功能支持

hipaa-compliance-auditorHIPAA合规审计