Experiment Designer

Design, prioritize, and evaluate product experiments with clear hypotheses and defensible decisions.

When To Use

Use this skill for:

- A/B and multivariate experiment planning
Hypothesis writing and success criteria definition
Sample size and minimum detectable effect planning
Experiment prioritization with ICE scoring
Reading statistical output for product decisions

Core Workflow

1. Write hypothesis in If/Then/Because format

- If we change INLINECODE0
Then [metric] will change by INLINECODE2
Because INLINECODE3

2. Define metrics before running test

- Primary metric: single decision metric
Guardrail metrics: quality/risk protection
Secondary metrics: diagnostics only

3. Estimate sample size

- Baseline conversion or baseline mean
Minimum detectable effect (MDE)
Significance level (alpha) and power

Use:
CODEBLOCK0

4. Prioritize experiments with ICE

- Impact: potential upside
Confidence: evidence quality
Ease: cost/speed/complexity

ICE Score = (Impact Confidence Ease) / 10

5. Launch with stopping rules

- Decide fixed sample size or fixed duration in advance
Avoid repeated peeking without proper method
Monitor guardrails continuously

6. Interpret results

- Statistical significance is not business significance
Compare point estimate + confidence interval to decision threshold
Investigate novelty effects and segment heterogeneity

Hypothesis Quality Checklist

- [ ] Contains explicit intervention and audience
[ ] Specifies measurable metric change
[ ] States plausible causal reason
[ ] Includes expected minimum effect
[ ] Defines failure condition

Common Experiment Pitfalls

- Underpowered tests leading to false negatives
Running too many simultaneous changes without isolation
Changing targeting or implementation mid-test
Stopping early on random spikes
Ignoring sample ratio mismatch and instrumentation drift
Declaring success from p-value without effect-size context

Statistical Interpretation Guardrails

- p-value < alpha indicates evidence against null, not guaranteed truth.
Confidence interval crossing zero/no-effect means uncertain directional claim.
Wide intervals imply low precision even when significant.
Use practical significance thresholds tied to business impact.

See:

- INLINECODE4
INLINECODE5

Tooling

`scripts/sample_size_calculator.py`

Computes required sample size (per variant and total) from:

- baseline rate
MDE (absolute or relative)
significance level (alpha)
statistical power

Example:
CODEBLOCK1

实验设计器

设计、优先排序并评估产品实验，提出清晰的假设并做出可辩护的决策。

使用场景

在以下情况下使用此技能：

- A/B测试及多变量实验规划
假设撰写与成功标准定义
样本量及最小可检测效应规划
使用ICE评分进行实验优先级排序
解读统计数据以支持产品决策

核心工作流程

1. 以如果/那么/因为格式撰写假设

- 如果我们改变[干预措施] - 那么[指标]将变化[预期方向/幅度] - 因为[行为机制]

2. 在测试前定义指标

- 主要指标：单一决策指标 - 护栏指标：质量/风险保护 - 次要指标：仅用于诊断

3. 估算样本量

- 基准转化率或基准均值 - 最小可检测效应（MDE） - 显著性水平（alpha）和统计功效

使用：
bash
python3 scripts/samplesizecalculator.py --baseline-rate 0.12 --mde 0.02 --mde-type absolute

4. 使用ICE对实验进行优先级排序

- 影响：潜在收益 - 置信度：证据质量 - 易实施性：成本/速度/复杂度

ICE分数 = (影响 × 置信度 × 易实施性) / 10

5. 设定停止规则后启动

- 提前决定固定样本量或固定时长 - 避免无适当方法的重复窥探 - 持续监控护栏指标

6. 解读结果

- 统计显著性不等于业务显著性 - 将点估计值+置信区间与决策阈值进行比较 - 调查新奇效应和细分异质性

假设质量检查清单

- [ ] 包含明确的干预措施和受众
[ ] 指定可衡量的指标变化
[ ] 陈述合理的因果理由
[ ] 包含预期的最小效应
[ ] 定义失败条件

常见实验陷阱

- 统计功效不足导致假阴性
同时运行过多变更而未进行隔离
在测试中途更改目标受众或实施方案
因随机波动而过早停止实验
忽略样本比例不匹配和测量工具漂移
仅凭p值宣布成功而不考虑效应量背景

统计解读护栏

- p值小于alpha表示存在反对零假设的证据，而非绝对真理
置信区间跨越零/无效应意味着方向性结论不确定
即使具有统计显著性，宽置信区间也意味着精度较低
使用与业务影响相关的实际显著性阈值

参见：

- references/experiment-playbook.md
references/statistics-reference.md

工具

scripts/samplesizecalculator.py

根据以下参数计算所需样本量（每个变体及总计）：

- 基准转化率
最小可检测效应（绝对或相对）
显著性水平（alpha）
统计功效

示例：
bash
python3 scripts/samplesizecalculator.py \
--baseline-rate 0.10 \
--mde 0.015 \
--mde-type absolute \
--alpha 0.05 \
--power 0.8

experiment-designer实验设计器