fox-data-analyst

# Data Analyst Skill 📊 **Turn your AI agent into a data analysis powerhouse.** Query databases, analyze spreadsheets, create visualizations, and generate insights that drive decisions. --- ## What This Skill Does ✅ **SQL Queries** — Write and execute queries against databases ✅ **Spreadsheet Analysis** — Process CSV, Excel, Google Sheets data ✅ **Data Visualization** — Create charts, graphs, and dashboards ✅ **Report Generation** — Automated reports with insights ✅ **Data Cleaning** — Handle missing data, outliers, formatting ✅ **Statistical Analysis** — Descriptive stats, trends, correlations --- ## Quick Start 1. Configure your data sources in `TOOLS.md`: ```markdown ### Data Sources - Primary DB: [Connection string or description] - Spreadsheets: [Google Sheets URL / local path] - Data warehouse: [BigQuery/Snowflake/etc.] ``` 2. Set up your workspace: ```bash ./scripts/data-init.sh ``` 3. Start analyzing! --- ## SQL Query Patterns ### Common Query Templates **Basic Data Exploration** ```sql -- Row count SELECT COUNT(*) FROM table_name; -- Sample data SELECT * FROM table_name LIMIT 10; -- Column statistics SELECT column_name, COUNT(*) as count, COUNT(DISTINCT column_name) as unique_values, MIN(column_name) as min_val, MAX(column_name) as max_val FROM table_name GROUP BY column_name; ``` **Time-Based Analysis** ```sql -- Daily aggregation SELECT DATE(created_at) as date, COUNT(*) as daily_count, SUM(amount) as daily_total FROM transactions GROUP BY DATE(created_at) ORDER BY date DESC; -- Month-over-month comparison SELECT DATE_TRUNC('month', created_at) as month, COUNT(*) as count, LAG(COUNT(*)) OVER (ORDER BY DATE_TRUNC('month', created_at)) as prev_month, (COUNT(*) - LAG(COUNT(*)) OVER (ORDER BY DATE_TRUNC('month', created_at))) / NULLIF(LAG(COUNT(*)) OVER (ORDER BY DATE_TRUNC('month', created_at)), 0) * 100 as growth_pct FROM transactions GROUP BY DATE_TRUNC('month', created_at) ORDER BY month; ``` **Cohort Analysis** ```sql -- User cohort by signup month SELECT DATE_TRUNC('month', u.created_at) as cohort_month, DATE_TRUNC('month', o.created_at) as activity_month, COUNT(DISTINCT u.id) as users FROM users u LEFT JOIN orders o ON u.id = o.user_id GROUP BY cohort_month, activity_month ORDER BY cohort_month, activity_month; ``` **Funnel Analysis** ```sql -- Conversion funnel WITH funnel AS ( SELECT COUNT(DISTINCT CASE WHEN event = 'page_view' THEN user_id END) as views, COUNT(DISTINCT CASE WHEN event = 'signup' THEN user_id END) as signups, COUNT(DISTINCT CASE WHEN event = 'purchase' THEN user_id END) as purchases FROM events WHERE date >= CURRENT_DATE - INTERVAL '30 days' ) SELECT views, signups, ROUND(signups * 100.0 / NULLIF(views, 0), 2) as signup_rate, purchases, ROUND(purchases * 100.0 / NULLIF(signups, 0), 2) as purchase_rate FROM funnel; ``` --- ## Data Cleaning ### Common Data Quality Issues | Issue | Detection | Solution | |-------|-----------|----------| | **Missing values** | `IS NULL` or empty string | Impute, drop, or flag | | **Duplicates** | `GROUP BY` with `HAVING COUNT(*) > 1` | Deduplicate with rules | | **Outliers** | Z-score > 3 or IQR method | Investigate, cap, or exclude | | **Inconsistent formats** | Sample and pattern match | Standardize with transforms | | **Invalid values** | Range checks, referential integrity | Validate and correct | ### Data Cleaning SQL Patterns ```sql -- Find duplicates SELECT email, COUNT(*) FROM users GROUP BY email HAVING COUNT(*) > 1; -- Find nulls SELECT COUNT(*) as total, SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) as null_emails, SUM(CASE WHEN name IS NULL THEN 1 ELSE 0 END) as null_names FROM users; -- Standardize text UPDATE products SET category = LOWER(TRIM(category)); -- Remove outliers (IQR method) WITH stats AS ( SELECT PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY value) as q1, PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY value) as q3 FROM data ) SELECT * FROM data, stats WHERE value BETWEEN q1 - 1.5*(q3-q1) AND q3 + 1.5*(q3-q1); ``` ### Data Cleaning Checklist ```markdown # Data Quality Audit: [Dataset] ## Row-Level Checks - [ ] Total row count: [X] - [ ] Duplicate rows: [X] - [ ] Rows with any null: [X] ## Column-Level Checks | Column | Type | Nulls | Unique | Min | Max | Issues | |--------|------|-------|--------|-----|-----|--------| | [col] | [type] | [n] | [n] | [v] | [v] | [notes] | ## Data Lineage - Source: [Where data came from] - Last updated: [Date] - Known issues: [List] ## Cleaning Actions Taken 1. [Action and reason] 2. [Action and reason] ``` --- ## Spreadsheet Analysis ### CSV/Excel Processing with Python ```python import pandas as pd # Load data df = pd.read_csv('data.csv') # or pd.read_excel('data.xlsx') # Basic exploration print(df.shape) # (rows, columns) print(df.info()) # Column types and nulls print(df.describe()) # Numeric statistics # Data cleaning df = df.drop_duplicates() df['date'] = pd.to_datetime(df['date']) df['amount'] = df['amount'].fillna(0) # Analysis summary = df.groupby('category').agg({ 'amount': ['sum', 'mean', 'count'], 'quantity': 'sum' }).round(2) # Export summary.to_csv('analysis_output.csv') ``` ### Common Pandas Operations ```python # Filtering filtered = df[df['status'] == 'active'] filtered = df[df['amount'] > 1000] filtered = df[df['date'].between('2024-01-01', '2024-12-31')] # Aggregation by_category = df.groupby('category')['amount'].sum() pivot = df.pivot_table(values='amount', index='month', columns='category', aggfunc='sum') # Window functions df['running_total'] = df['amount'].cumsum() df['pct_change'] = df['amount'].pct_change() df['rolling_avg'] = df['amount'].rolling(window=7).mean() # Merging merged = pd.merge(df1, df2, on='id', how='left') ``` --- ## Data Visualization ### Chart Selection Guide | Data Type | Best Chart | Use When | |-----------|------------|----------| | Trend over time | Line chart | Showing patterns/changes over time | | Category comparison | Bar chart | Comparing discrete categories | | Part of whole | Pie/Donut | Showing proportions (≤5 categories) | | Distribution | Histogram | Understanding data spread | | Correlation | Scatter plot | Relationship between two variables | | Many categories | Horizontal bar | Ranking or comparing many items | | Geographic | Map | Location-based data | ### Python Visualization with Matplotlib/Seaborn ```python import matplotlib.pyplot as plt import seaborn as sns # Set style plt.style.use('seaborn-v0_8-whitegrid') sns.set_palette("husl") # Line chart (trends) plt.figure(figsize=(10, 6)) plt.plot(df['date'], df['value'], marker='o') plt.title('Trend Over Time') plt.xlabel('Date') plt.ylabel('Value') plt.xticks(rotation=45) plt.tight_layout() plt.savefig('trend.png', dpi=150) # Bar chart (comparisons) plt.figure(figsize=(10, 6)) sns.barplot(data=df, x='category', y='amount') plt.title('Amount by Category') plt.xticks(rotation=45) plt.tight_layout() plt.savefig('comparison.png', dpi=150) # Heatmap (correlations) plt.figure(figsize=(10, 8)) sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0) plt.title('Correlation Matrix') plt.tight_layout() plt.savefig('correlation.png', dpi=150) ``` ### ASCII Charts (Quick Terminal Visualization) When you can't generate images, use ASCII: ``` Revenue by Month (in $K) ======================== Jan: ████████████████ 160 Feb: ██████████████████ 180 Mar: ████████████████████████ 240 Apr: ██████████████████████ 220 May: ██████████████████████████ 260 Jun: ████████████████████████████ 280 ``` --- ## Report Generation ### Standard Report Template ```markdown # [Report Name] **Period:** [Date range] **Generated:** [Date] **Author:** [Agent/Human] ## Executive Summary [2-3 sentences with key findings] ## Key Metrics | Metric | Current | Previous | Change | |--------|---------|----------|--------| | [Metric] | [Value] | [Value] | [+/-X%] | ## Detailed Analysis ### [Section 1] [Analysis with supporting data] ### [Section 2] [Analysis with supporting data] ## Visualizations [Insert charts] ## Insights 1. **[Insight]**: [Supporting evidence] 2. **[Insight]**: [Supporting evidence] ## Recommendations 1. [Actionable recommendation] 2. [Actionable recommendation] ## Methodology - Data source: [Source] - Date range: [Range] - Filters applied: [Filters] - Known limitations: [Limitations] ## Appendix [Supporting data tables] ``` ### Automated Report Script ```bash #!/bin/bash # generate-report.sh # Pull latest data python scripts/extract_data.py --output data/latest.csv # Run analysis python scripts/analyze.py --input data/latest.csv --output reports/ # Generate report python scripts/format_report.py --template weekly --output reports/weekly-$(date +%Y-%m-%d).md echo "Report generated: reports/weekly-$(date +%Y-%m-%d).md" ``` --- ## Statistical Analysis ### Descriptive Statistics | Statistic | What It Tells You | Use Case | |-----------|-------------------|----------| | **Mean** | Average value | Central tendency | | **Median** | Middle value | Robust to outliers | | **Mode** | Most common | Categorical data | | **Std Dev** | Spread around mean | Variability | | **Min/Max** | Range | Data boundaries | | **Percentiles** | Distribution shape | Benchmarking | ### Quick Stats with Python ```python # Full descriptive statistics stats = df['amount'].describe() print(stats) # Additional stats print(f"Median: {df['amount'].median()}") print(f"Mode: {df['amount'].mode()[0]}") print(f"Skewness: {df['amount'].skew()}") print(f"Kurtosis: {df['amount'].kurtosis()}") # Correlation correlation = df['sales'].corr(df['marketing_spend']) print(f"Correlation: {correlation:.3f}") ``` ### Statistical Tests Quick Reference | Test | Use Case | Python | |------|----------|--------| | T-test | Compare two means | `scipy.stats.ttest_ind(a, b)` | | Chi-square | Categorical independence | `scipy.stats.chi2_contingency(table)` | | ANOVA | Compare 3+ means | `scipy.stats.f_oneway(a, b, c)` | | Pearson | Linear correlation | `scipy.stats.pearsonr(x, y)` | --- ## Analysis Workflow ### Standard Analysis Process 1. **Define the Question** - What are we trying to answer? - What decisions will this inform? 2. **Understand the Data** - What data is available? - What's the structure and quality? 3. **Clean and Prepare** - Handle missing values - Fix data types - Remove duplicates 4. **Explore** - Descriptive statistics - Initial visualizations - Identify patterns 5. **Analyze** - Deep dive into findings - Statistical tests if needed - Validate hypotheses 6. **Communicate** - Clear visualizations - Actionable insights - Recommendations ### Analysis Request Template ```markdown # Analysis Request ## Question [What are we trying to answer?] ## Context [Why does this matter? What decision will it inform?] ## Data Available - [Dataset 1]: [Description] - [Dataset 2]: [Description] ## Expected Output - [Deliverable 1] - [Deliverable 2] ## Timeline [When is this needed?] ## Notes [Any constraints or considerations] ``` --- ## Scripts ### data-init.sh Initialize your data analysis workspace. ### query.sh Quick SQL query execution. ```bash # Run query from file ./scripts/query.sh --file queries/daily-report.sql # Run inline query ./scripts/query.sh "SELECT COUNT(*) FROM users" # Save output to file ./scripts/query.sh --file queries/export.sql --output data/export.csv ``` ### analyze.py Python analysis toolkit. ```bash # Basic analysis python scripts/analyze.py --input data/sales.csv # With specific analysis type python scripts/analyze.py --input data/sales.csv --type cohort # Generate report python scripts/analyze.py --input data/sales.csv --report weekly ``` --- ## Integration Tips ### With Other Skills | Skill | Integration | |-------|-------------| | **Marketing** | Analyze campaign performance, content metrics | | **Sales** | Pipeline analytics, conversion analysis | | **Business Dev** | Market research data, competitor analysis | ### Common Data Sources - **Databases:** PostgreSQL, MySQL, SQLite - **Warehouses:** BigQuery, Snowflake, Redshift - **Spreadsheets:** Google Sheets, Excel, CSV - **APIs:** REST endpoints, GraphQL - **Files:** JSON, Parquet, XML --- ## Best Practices 1. **Start with the question** — Know what you're trying to answer 2. **Validate your data** — Garbage in = garbage out 3. **Document everything** — Queries, assumptions, decisions 4. **Visualize appropriately** — Right chart for right data 5. **Show your work** — Methodology matters 6. **Lead with insights** — Not just data dumps 7. **Make it actionable** — "So what?" → "Now what?" 8. **Version your queries** — Track changes over time --- ## Common Mistakes ❌ **Confirmation bias** — Looking for data to support a conclusion ❌ **Correlation ≠ causation** — Be careful with claims ❌ **Cherry-picking** — Using only favorable data ❌ **Ignoring outliers** — Investigate before removing ❌ **Over-complicating** — Simple analysis often wins ❌ **No context** — Numbers without comparison are meaningless --- ## License **License:** MIT — use freely, modify, distribute. --- *"The goal is to turn data into information, and information into insight." — Carly Fiorina*

fox-data-analyst

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

fox-data-analyst

fox-data-analyst

标签

通过对话安装

方式一：安装 SkillHub 和技能

方式二：设置 SkillHub 为优先技能安装源

通过命令行安装

下载 Zip 包

相关推荐

self-improvement

self-improvement

self-improvement

self-improvement