Pandas Data Processing Skill
English | 简体中文
This skill provides comprehensive pandas data processing capabilities through executable scripts and reference documentation. Use this skill whenever tasks involve data manipulation, cleaning, analysis, or transformation of tabular data.
When to Use This Skill
Activate this skill when the user requests:
- - Data cleaning operations (handling missing values, duplicates, outliers)
- Data analysis and statistical summaries
- Format conversions (CSV ↔ Excel ↔ JSON ↔ Parquet)
- Data transformation (filtering, sorting, aggregation, pivoting)
- Merging or combining multiple datasets
- Generating data quality reports
- Any pandas DataFrame operations
Core Capabilities
1. Data Cleaning (scripts/data_cleaner.py)
Handles common data cleaning tasks with a single command:
Usage:
CODEBLOCK0
Available Options:
- -
--remove-duplicates: Remove duplicate rows - INLINECODE2 : Handle missing values
- Strategies:
drop,
fill,
forward,
backward,
mean,
median
- -
--fill-value [value]: Custom fill value for missing data - INLINECODE10 : Remove outliers using IQR or Z-score method
- INLINECODE11 : Choose
iqr or zscore (default: iqr) - INLINECODE14 : Standardize column names (lowercase, underscores)
Example:
CODEBLOCK1
2. Data Analysis (scripts/data_analyzer.py)
Generates comprehensive data analysis reports:
Usage:
CODEBLOCK2
Available Options:
- -
--output, -o [file]: Save report to file - INLINECODE17 : Output format (
json or text, default: json)
Report Includes:
- - Basic information (rows, columns, memory usage)
- Data type distribution
- Missing values analysis
- Numeric column statistics (mean, std, min, max, quartiles, skewness, kurtosis)
- Categorical column statistics (unique values, value counts)
- Correlation analysis
- Outlier detection
Example:
CODEBLOCK3
3. Data Transformation (scripts/data_transformer.py)
Performs various data transformation operations through subcommands:
Convert Format
python scripts/data_transformer.py convert input.csv output.xlsx
Supports: CSV, Excel (.xlsx/.xls), JSON, Parquet, HTML
Merge Files
CODEBLOCK5
Filter Data
CODEBLOCK6
Sort Data
CODEBLOCK7
Select Columns
CODEBLOCK8
Reference Documentation
The references/ directory contains detailed documentation:
references/common_operations.md
Comprehensive reference covering:
- - Data reading/saving (CSV, Excel, JSON, SQL, Parquet)
- Data exploration (head, info, describe, dtypes)
- Data selection and filtering (loc, iloc, boolean indexing, query)
- Data cleaning (handling missing/duplicate values, type conversion)
- Data transformation (apply, map, sorting, column operations)
- Groupby and aggregation operations
- Pivot tables
- Merging and joining (concat, merge, join)
- Time series operations
- String operations
- Performance optimization tips
When to use: When Claude needs to understand pandas syntax or find the right method for a specific operation.
references/data_cleaning_best_practices.md
Best practices guide covering:
- - Data quality check checklist
- Missing value handling strategies with decision tree
- Outlier detection methods (IQR, Z-Score, percentile)
- Data type optimization for memory efficiency
- String cleaning techniques
- Date/time standardization
- Complete cleaning pipeline template
- Common problems and solutions
- Data validation methods
When to use: When designing a data cleaning workflow or deciding on the best approach for specific data quality issues.
Workflow Guidelines
Step 1: Initial Assessment
Always start by analyzing the data:
python scripts/data_analyzer.py input_file.csv -o analysis_report.json
Review the report to understand data quality, types, missing values, and potential issues.
Step 2: Plan Cleaning Strategy
Based on the analysis report:
- - Identify missing value strategy (reference:
data_cleaning_best_practices.md) - Determine if duplicates should be removed
- Decide on outlier handling approach
- Plan any necessary type conversions
Step 3: Execute Cleaning
Run the data cleaner with appropriate options:
CODEBLOCK10
Step 4: Transform as Needed
Apply any transformations (filtering, sorting, format conversion, merging):
CODEBLOCK11
Step 5: Validate Results
Re-run analysis on the cleaned data to verify improvements:
CODEBLOCK12
Common Patterns
Pattern 1: Quick Data Quality Report
CODEBLOCK13
Pattern 2: Standard Cleaning Pipeline
CODEBLOCK14
Pattern 3: Excel to CSV with Filtering
CODEBLOCK15
Pattern 4: Merge Multiple CSVs
CODEBLOCK16
Dependencies
Ensure pandas is installed:
CODEBLOCK17
Optional for specific formats:
CODEBLOCK18
Tips for Effective Use
- 1. Start with analysis: Always run the analyzer first to understand the data
- Incremental cleaning: Apply cleaning operations step by step, verify each step
- Preserve originals: Never overwrite original data files
- Check references: Consult reference docs for complex operations or best practices
- Validate results: Use the analyzer to verify cleaning effectiveness
- Memory efficiency: For large files, consider using the data type optimization techniques in the reference docs
- Combine operations: Chain multiple transformer commands for complex workflows
Limitations
- - Scripts work with single-machine memory constraints (for very large datasets, consider Dask)
- Time series resampling and rolling operations require custom pandas code
- Complex statistical modeling beyond basic descriptive statistics requires additional libraries
- For advanced visualizations, use matplotlib/seaborn directly
Troubleshooting
Import errors: Ensure pandas and dependencies are installed
Memory errors: Process data in chunks or optimize dtypes (see references)
Encoding issues: Add encoding='utf-8' parameter when loading CSVs
Date parsing issues: Use pd.to_datetime() with explicit format string
For detailed pandas operations and troubleshooting, always refer to references/common_operations.md and references/data_cleaning_best_practices.md.
Pandas 数据处理技能
本技能通过可执行脚本和参考文档提供全面的 pandas 数据处理能力。当任务涉及表格数据的操作、清洗、分析或转换时,请使用本技能。
何时使用本技能
当用户提出以下请求时,激活本技能:
- - 数据清洗操作(处理缺失值、重复值、异常值)
- 数据分析和统计汇总
- 格式转换(CSV ↔ Excel ↔ JSON ↔ Parquet)
- 数据转换(筛选、排序、聚合、透视)
- 合并或组合多个数据集
- 生成数据质量报告
- 任何 pandas DataFrame 操作
核心能力
1. 数据清洗 (scripts/data_cleaner.py)
通过单个命令处理常见的数据清洗任务:
使用方法:
bash
python scripts/data_cleaner.py input.csv output.csv [选项]
可用选项:
- - --remove-duplicates:删除重复行
- --handle-missing [策略]:处理缺失值
- 策略:drop(删除)、fill(填充)、forward(前向填充)、backward(后向填充)、mean(均值)、median(中位数)
- - --fill-value [值]:缺失数据的自定义填充值
- --remove-outliers:使用 IQR 或 Z-score 方法删除异常值
- --outlier-method [方法]:选择 iqr 或 zscore(默认:iqr)
- --standardize-columns:标准化列名(小写、下划线)
示例:
bash
python scripts/datacleaner.py data.csv cleaneddata.csv \
--remove-duplicates \
--handle-missing mean \
--remove-outliers \
--standardize-columns
2. 数据分析 (scripts/data_analyzer.py)
生成全面的数据分析报告:
使用方法:
bash
python scripts/data_analyzer.py input.csv [选项]
可用选项:
- - --output, -o [文件]:将报告保存到文件
- --format [格式]:输出格式(json 或 text,默认:json)
报告包含:
- - 基本信息(行数、列数、内存使用)
- 数据类型分布
- 缺失值分析
- 数值列统计(均值、标准差、最小值、最大值、四分位数、偏度、峰度)
- 分类列统计(唯一值、值计数)
- 相关性分析
- 异常值检测
示例:
bash
python scripts/dataanalyzer.py salesdata.csv -o report.json --format json
3. 数据转换 (scripts/data_transformer.py)
通过子命令执行各种数据转换操作:
格式转换
bash
python scripts/data_transformer.py convert input.csv output.xlsx
支持:CSV、Excel (.xlsx/.xls)、JSON、Parquet、HTML
合并文件
bash
python scripts/data_transformer.py merge file1.csv file2.csv file3.csv \
--output merged.csv \
--how outer \
--on key_column
筛选数据
bash
python scripts/data_transformer.py filter data.csv \
--query age > 18 and city == Beijing \
--output filtered.csv
排序数据
bash
python scripts/data_transformer.py sort data.csv \
--by sales quantity \
--descending \
--output sorted.csv
选择列
bash
python scripts/data_transformer.py select data.csv \
--columns name age city \
--output selected.csv
参考文档
references/ 目录包含详细文档:
references/common_operations.md
全面的参考文档,涵盖:
- - 数据读取/保存(CSV、Excel、JSON、SQL、Parquet)
- 数据探索(head、info、describe、dtypes)
- 数据选择和筛选(loc、iloc、布尔索引、query)
- 数据清洗(处理缺失/重复值、类型转换)
- 数据转换(apply、map、排序、列操作)
- Groupby 和聚合操作
- 数据透视表
- 合并和连接(concat、merge、join)
- 时间序列操作
- 字符串操作
- 性能优化技巧
何时使用: 当 Claude 需要理解 pandas 语法或为特定操作找到正确方法时。
references/datacleaningbest_practices.md
最佳实践指南,涵盖:
- - 数据质量检查清单
- 带决策树的缺失值处理策略
- 异常值检测方法(IQR、Z-Score、百分位数)
- 数据类型优化以提高内存效率
- 字符串清洗技术
- 日期/时间标准化
- 完整清洗流程模板
- 常见问题及解决方案
- 数据验证方法
何时使用: 当设计数据清洗工作流或决定特定数据质量问题的最佳方法时。
工作流程指南
步骤 1:初步评估
始终从分析数据开始:
bash
python scripts/data
analyzer.py inputfile.csv -o analysis_report.json
审查报告以了解数据质量、类型、缺失值和潜在问题。
步骤 2:规划清洗策略
基于分析报告:
- - 确定缺失值策略(参考:datacleaningbest_practices.md)
- 判断是否需要删除重复值
- 决定异常值处理方法
- 规划必要的类型转换
步骤 3:执行清洗
使用适当的选项运行数据清洗器:
bash
python scripts/data_cleaner.py input.csv cleaned.csv [选项]
步骤 4:按需转换
应用任何转换(筛选、排序、格式转换、合并):
bash
python scripts/data_transformer.py [子命令] [选项]
步骤 5:验证结果
对清洗后的数据重新运行分析以验证改进:
bash
python scripts/data
analyzer.py cleaned.csv -o finalreport.json
常见模式
模式 1:快速数据质量报告
bash
python scripts/data_analyzer.py data.csv --format text
模式 2:标准清洗流程
bash
python scripts/data
cleaner.py rawdata.csv clean_data.csv \
--standardize-columns \
--remove-duplicates \
--handle-missing median \
--remove-outliers
模式 3:Excel 转 CSV 并筛选
bash
转换
python scripts/data_transformer.py convert data.xlsx data.csv
筛选
python scripts/data_transformer.py filter data.csv \
--query status == active \
--output filtered.csv
模式 4:合并多个 CSV
bash
python scripts/data_transformer.py merge *.csv \
--output combined.csv
依赖项
确保已安装 pandas:
bash
pip install pandas numpy openpyxl
特定格式的可选依赖:
bash
pip install pyarrow # 支持 Parquet
pip install xlrd # 支持旧版 Excel 文件 (.xls)
有效使用技巧
- 1. 从分析开始: 始终先运行分析器以了解数据
- 增量清洗: 逐步应用清洗操作,验证每一步
- 保留原始文件: 切勿覆盖原始数据文件
- 查阅参考文档: 对于复杂操作或最佳实践,查阅参考文档
- 验证结果: 使用分析器验证清洗效果
- 内存效率: 对于大文件,考虑使用参考文档中的数据类型优化技术
- 组合操作: 对于复杂工作流,串联多个转换器命令
局限性
- - 脚本受单机内存限制(对于非常大的数据集,考虑使用 Dask)
- 时间序列重采样和滚动操作需要自定义 pandas 代码
- 超出基本描述性统计的复杂统计建模需要额外库
- 对于高级可视化,直接使用 matplotlib/seaborn
故障排除
导入错误: 确保已安装 pandas 和依赖项
内存错误: 分块处理数据或优化数据类型(参见参考文档)
编码问题: 加载 CSV 时添加 encoding=utf-8 参数
日期解析问题: 使用带显式格式字符串的 pd.to_datetime()
有关详细的 pandas 操作和故障排除,请始终参考 references/commonoperations.md 和 references/datacleaningbestpractices.md。