返回顶部
d

data-cleaning-annotation-workflow

Complete workflow for time series datasets (Energy, Manufacturing, Climate) on Kaggle to Data Annotation platform (data.smlcrm.com). Includes downloading, cleaning with pandas, uploading RAW with metadata, configuring columns (Time/Target/Covariate/Group), setting units (kWh, kVarh, tCO2, ratio, seconds), and assigning groups by selecting all variables and applying all group tags. Use when finding Kaggle datasets, cleaning for ML, uploading with metadata, configuring types/units, assigning group

作者: admin | 来源: ClawHub
源自
ClawHub
版本
V 1.0.0
安全检测
已通过
763
下载量
0
收藏
概述
安装方式
版本历史

data-cleaning-annotation-workflow

# Simulacrum Data Annotation Workflow Complete end-to-end workflow for time series dataset preparation and annotation on the Data Annotation platform (data.smlcrm.com). ## What This Skill Does This skill captures the precise workflow for processing time series datasets (Energy, Manufacturing, Climate) from discovery to CLEAN status: 1. **Find Dataset**: Search Kaggle for Energy/Manufacturing/Climate time series data 2. **Download**: Get CSV files via browser or Kaggle CLI 3. **Clean**: Run Python/pandas script to handle missing values, duplicates, formatting 4. **Upload RAW**: Upload original CSV with metadata (name, domain, source URL, description) 5. **Configure Headers**: Set column types (Time, Target, Covariate, Group) and units 6. **Assign Groups**: Select ALL variables (target + covariates), apply ALL group tags 7. **Upload Cleaned**: Final upload → **CLEAN** status ## Supported Domains - **Energy**: Power consumption, utilities, renewable energy, grid data - **Manufacturing**: Industrial processes, steel production, emissions, equipment data - **Climate**: CO2 emissions, environmental monitoring, weather correlation data ## Quick Start For the full pipeline from Kaggle to annotated dataset: ``` 1. Find dataset on Kaggle 2. Download (browser or kaggle CLI) 3. Clean with scripts/clean_dataset.py 4. Upload RAW dataset to data.smlcrm.com (with metadata) 5. Click "Clean" and upload cleaned file 6. Configure column metadata (types, units) 7. Assign groups to variables 8. Upload cleaned dataset → CLEAN status ``` ## Workflow Steps ### Step 1: Find and Download Dataset **From Kaggle (Browser Method):** 1. Navigate to kaggle.com/datasets 2. Search for relevant dataset (e.g., "steel industry energy consumption", "manufacturing emissions", "climate CO2") 3. Review data description, file list, and preview 4. Click "Download" button 5. Extract CSV file from downloaded zip **Alternative: Kaggle CLI** ```bash # Install if needed: pip install kaggle # Configure: kaggle competitions list scripts/download_kaggle.sh <dataset-name> [output-dir] # Example: scripts/download_kaggle.sh csafrit2/steel-industry-energy-consumption ``` ### Step 2: Clean the Dataset **Always run the cleaning script before upload:** ```bash python3 scripts/clean_dataset.py <input.csv> [-o <output.csv>] ``` **What the script does:** - Strips whitespace from column names - Removes duplicate rows - Fills missing numeric values with median - Fills missing categorical values with mode or 'Unknown' - Converts timestamp columns to datetime format - Outputs column summary for metadata configuration **Output:** - Cleaned CSV file ready for upload - Column summary printed to console (save this for metadata config) ### Step 3: Upload Raw Dataset to Platform 1. Navigate to data.smlcrm.com/dashboard 2. Click **"Upload Dataset"** button 3. Fill in metadata for the RAW dataset: - **Name**: Descriptive dataset name - **Domain**: Category (Energy, Manufacturing, Climate, etc.) - **Source URL**: Kaggle or original source URL - **Description**: Brief summary of the dataset 4. Upload the **original/raw** CSV file (not cleaned yet) 5. Click **Upload** **Result:** Dataset appears in list with **RAW** status ### Step 4: Upload Cleaned File & Configure Metadata 1. Find the RAW dataset in the list 2. Click **"Clean"** button 3. Upload the **cleaned** CSV file (from Step 2) 4. Configure headers for each column: | Setting | Description | |---------|-------------| | **Name** | Column name (editable) | | **Units** | Measurement units (kWh, °C, %, ratio, tCO2, etc.) | | **Type** | Time / Target / Covariate / Group | **Column Type Guide:** - **Time**: Timestamp/datetime columns (usually required) - **Target**: Variable to predict (at least one required) - **Covariate**: Input features/independent variables - **Group**: Categorical segment variables (WeekStatus, Day_of_week, Load_Type, etc.) **Bulk Configuration:** - Select multiple rows via checkboxes - Use "Apply" dropdown to set type for selected columns - Set units individually or in bulk **Common Unit Patterns:** - Energy: kWh, MWh, MW - Power: kVarh, kW - Emissions: tCO2, kgCO2 - Ratios: ratio, % - Time: seconds, minutes, hours ### Step 5: Assign Groups to Variables **Purpose:** Group variables define how data is segmented for analysis. **Exact Workflow:** 1. **Select ALL variables** by checking their checkboxes: - Target variable(s) - ALL covariate variables 2. **Apply ALL group tags** to selected variables: - Click first group tag (e.g., WeekStatus) → all selected get this group - Click second group tag (e.g., Day_of_week) → all selected get this group - Click third group tag (e.g., Load_Type) → all selected get this group - Continue for all available group tags 3. **Result:** All variables have all groups assigned (e.g., "WeekStatus × Day_of_week × Load_Type") **Important:** Assign groups to BOTH target variables AND all covariates. ### Step 6: Final Upload 1. Click **"Upload Cleaned Dataset"** button 2. Wait for processing 3. Dataset status changes from **RAW** → **CLEAN** 4. Verify data points count is correct ## Example: Steel Industry Energy Dataset **Source:** https://www.kaggle.com/datasets/csafrit2/steel-industry-energy-consumption **Metadata:** - **Name:** Steel Industry Energy Consumption (South Korea) - **Domain:** Energy - **Data Points:** 350,400 **Column Configuration:** | Column | Type | Units | |--------|------|-------| | Timestamps | Time | - | | Usage_kWh | Target | kWh | | Lagging_Current_Reactive.Power_kVarh | Covariate | kVarh | | Leading_Current_Reactive_Power_kVarh | Covariate | kVarh | | CO2(tCO2) | Covariate | tCO2 | | Lagging_Current_Power_Factor | Covariate | ratio | | Leading_Current_Power_Factor | Covariate | ratio | | NSM | Covariate | seconds | | WeekStatus | Group | - | | Day_of_week | Group | - | | Load_Type | Group | - | **Group Assignment:** 1. Select: Usage_kWh, Lagging_Current_Reactive.Power_kVarh, Leading_Current_Reactive_Power_kVarh, CO2(tCO2), Lagging_Current_Power_Factor, Leading_Current_Power_Factor, NSM 2. Click: WeekStatus → all selected get WeekStatus 3. Click: Day_of_week → all selected get Day_of_week 4. Click: Load_Type → all selected get Load_Type 5. Final: All variables show "WeekStatus × Day_of_week × Load_Type" ## Reference Materials For detailed platform configuration guidance, see [references/platform_guide.md](references/platform_guide.md). ## Troubleshooting **"Next" button disabled:** - Check at least one Time column is set - Check at least one Target column is set - Verify all columns have types assigned **Groups not appearing:** - Columns must be marked as "Group" type first - Proceed to next step after setting Group types **Upload fails:** - Re-run cleaning script - Check CSV format (comma-delimited) - Verify no empty column names ## Scripts | Script | Purpose | |--------|---------| | `scripts/clean_dataset.py` | Clean and prepare CSV for upload | | `scripts/download_kaggle.sh` | Download datasets via Kaggle CLI | ## Platform URL Data Annotation Platform: https://data.smlcrm.com

标签

skill ai

通过对话安装

该技能支持在以下平台通过对话安装:

OpenClaw WorkBuddy QClaw Kimi Claude

方式一:安装 SkillHub 和技能

帮我安装 SkillHub 和 data-cleaning-annotation-workflow-1776419994 技能

方式二:设置 SkillHub 为优先技能安装源

设置 SkillHub 为我的优先技能安装源,然后帮我安装 data-cleaning-annotation-workflow-1776419994 技能

通过命令行安装

skillhub install data-cleaning-annotation-workflow-1776419994

下载 Zip 包

⬇ 下载 data-cleaning-annotation-workflow v1.0.0

文件大小: 8.04 KB | 发布时间: 2026-4-17 19:31

v1.0.0 最新 2026-4-17 19:31
- Initial release of complete end-to-end workflow for preparing, cleaning, and annotating time series datasets (Energy, Manufacturing, Climate) using the Data Annotation platform.
- Step-by-step instructions for finding datasets on Kaggle, downloading, cleaning via pandas scripts, and uploading both raw and cleaned files with full metadata.
- Detailed guidance on configuring column types (Time, Target, Covariate, Group), setting measurement units, and bulk-assigning group tags to all relevant variables.
- Workflow explicitly covers group assignment for both targets and covariates, emphasizing all-variables-to-all-groups mapping.
- Troubleshooting section and script usage notes included for common platform and data preparation issues.

Archiver·手机版·闲社网·闲社论坛·羊毛社区· 多链控股集团有限公司 · 苏ICP备2025199260号-1

Powered by Discuz! X5.0   © 2024-2025 闲社网·线报更新论坛·羊毛分享社区·http://xianshe.com

p2p_official_large
返回顶部