Experiment Designer
Design, prioritize, and evaluate product experiments with clear hypotheses and defensible decisions.
When To Use
Use this skill for:
- - A/B and multivariate experiment planning
- Hypothesis writing and success criteria definition
- Sample size and minimum detectable effect planning
- Experiment prioritization with ICE scoring
- Reading statistical output for product decisions
Core Workflow
- 1. Write hypothesis in If/Then/Because format
- - If we change INLINECODE0
- Then
[metric] will change by INLINECODE2 - Because INLINECODE3
- 2. Define metrics before running test
- - Primary metric: single decision metric
- Guardrail metrics: quality/risk protection
- Secondary metrics: diagnostics only
- 3. Estimate sample size
- - Baseline conversion or baseline mean
- Minimum detectable effect (MDE)
- Significance level (alpha) and power
Use:
CODEBLOCK0
- 4. Prioritize experiments with ICE
- - Impact: potential upside
- Confidence: evidence quality
- Ease: cost/speed/complexity
ICE Score = (Impact Confidence Ease) / 10
- 5. Launch with stopping rules
- - Decide fixed sample size or fixed duration in advance
- Avoid repeated peeking without proper method
- Monitor guardrails continuously
- 6. Interpret results
- - Statistical significance is not business significance
- Compare point estimate + confidence interval to decision threshold
- Investigate novelty effects and segment heterogeneity
Hypothesis Quality Checklist
- - [ ] Contains explicit intervention and audience
- [ ] Specifies measurable metric change
- [ ] States plausible causal reason
- [ ] Includes expected minimum effect
- [ ] Defines failure condition
Common Experiment Pitfalls
- - Underpowered tests leading to false negatives
- Running too many simultaneous changes without isolation
- Changing targeting or implementation mid-test
- Stopping early on random spikes
- Ignoring sample ratio mismatch and instrumentation drift
- Declaring success from p-value without effect-size context
Statistical Interpretation Guardrails
- - p-value < alpha indicates evidence against null, not guaranteed truth.
- Confidence interval crossing zero/no-effect means uncertain directional claim.
- Wide intervals imply low precision even when significant.
- Use practical significance thresholds tied to business impact.
See:
Tooling
scripts/sample_size_calculator.py
Computes required sample size (per variant and total) from:
- - baseline rate
- MDE (absolute or relative)
- significance level (alpha)
- statistical power
Example:
CODEBLOCK1
实验设计器
设计、优先排序并评估产品实验,提出清晰的假设并做出可辩护的决策。
使用场景
在以下情况下使用此技能:
- - A/B测试及多变量实验规划
- 假设撰写与成功标准定义
- 样本量及最小可检测效应规划
- 使用ICE评分进行实验优先级排序
- 解读统计数据以支持产品决策
核心工作流程
- 1. 以如果/那么/因为格式撰写假设
- 如果我们改变[干预措施]
- 那么[指标]将变化[预期方向/幅度]
- 因为[行为机制]
- 2. 在测试前定义指标
- 主要指标:单一决策指标
- 护栏指标:质量/风险保护
- 次要指标:仅用于诊断
- 3. 估算样本量
- 基准转化率或基准均值
- 最小可检测效应(MDE)
- 显著性水平(alpha)和统计功效
使用:
bash
python3 scripts/samplesizecalculator.py --baseline-rate 0.12 --mde 0.02 --mde-type absolute
- 4. 使用ICE对实验进行优先级排序
- 影响:潜在收益
- 置信度:证据质量
- 易实施性:成本/速度/复杂度
ICE分数 = (影响 × 置信度 × 易实施性) / 10
- 5. 设定停止规则后启动
- 提前决定固定样本量或固定时长
- 避免无适当方法的重复窥探
- 持续监控护栏指标
- 6. 解读结果
- 统计显著性不等于业务显著性
- 将点估计值+置信区间与决策阈值进行比较
- 调查新奇效应和细分异质性
假设质量检查清单
- - [ ] 包含明确的干预措施和受众
- [ ] 指定可衡量的指标变化
- [ ] 陈述合理的因果理由
- [ ] 包含预期的最小效应
- [ ] 定义失败条件
常见实验陷阱
- - 统计功效不足导致假阴性
- 同时运行过多变更而未进行隔离
- 在测试中途更改目标受众或实施方案
- 因随机波动而过早停止实验
- 忽略样本比例不匹配和测量工具漂移
- 仅凭p值宣布成功而不考虑效应量背景
统计解读护栏
- - p值小于alpha表示存在反对零假设的证据,而非绝对真理
- 置信区间跨越零/无效应意味着方向性结论不确定
- 即使具有统计显著性,宽置信区间也意味着精度较低
- 使用与业务影响相关的实际显著性阈值
参见:
- - references/experiment-playbook.md
- references/statistics-reference.md
工具
scripts/samplesizecalculator.py
根据以下参数计算所需样本量(每个变体及总计):
- - 基准转化率
- 最小可检测效应(绝对或相对)
- 显著性水平(alpha)
- 统计功效
示例:
bash
python3 scripts/samplesizecalculator.py \
--baseline-rate 0.10 \
--mde 0.015 \
--mde-type absolute \
--alpha 0.05 \
--power 0.8