Experiment Designer
Design statistically rigorous A/B tests and interpret results with recommendations.
Installation
- Make sure Claude is on your device and in your terminal.
Skills load from
~/.claude/skills/when Claude Code starts up — so you need it on your machine first. If you don't have it yet, install it once with the command below, then runclaudein any terminal to verify.One-time setupnpm i -g @anthropic-ai/claude-codeAlready have it? Skip ahead.
- Paste into Claude Code or into your terminal.
This copies the whole skill folder into
~/.claude/skills/experiment-designer-mohitagw15856/— the SKILL.md plus any scripts, reference docs, or templates the skill ships with. Safe default: works for every skill.Faster alternative (instruction-only skills)
Skips the clone and grabs only the SKILL.md file. Don't use this if the skill ships Python scripts, reference markdowns, or asset templates — they won't be downloaded and the skill will fail when it tries to load them.
Quick install (SKILL.md only)Sign up to copy - Restart Claude Code.
Quit and reopen Claude Code (or any other agent that loads from
~/.claude/skills/). New skills are picked up on startup. - Just ask Claude.
Skills auto-activate when your request matches the skill's description — no slash command needed. Trigger phrases live in the skill's own frontmatter; you can read them in the “What this skill does” section above.
Prefer to read the source first? Open on GitHub.
When Claude uses it
Design statistically rigorous A/B tests and interpret experiment results. Use when asked to design an experiment, run an A/B test, calculate sample size, interpret test results, or assess whether an experiment was successful. Produces a complete experiment design with hypothesis, sample size, run time, success criteria, and risk flags — or a results interpretation with ship/iterate/kill recommendation.
What this skill does
Experiment Designer Skill
Produce rigorous experiment designs from product hypotheses, and interpret results with statistical and practical significance — so you can defend every decision to a sceptical engineering lead or data scientist.
Required Inputs
Ask the user for these if not provided: For experiment design:
- Hypothesis (what change, what metric, what expected movement)
- Current baseline metric value
- Minimum detectable effect (MDE) — the smallest lift worth caring about
- Available daily sample size
For results interpretation:
- Control and variant results (raw numbers or percentages)
- P-value or confidence interval
- Run duration (days)
- Any anomalies observed during the test
Two-Phase Process
Phase 1: Experiment Design
- Restate hypothesis as: "If we [change], we expect [metric] to [move by X%] because [reason]"
- Define control and variant clearly
- Select primary metric (one only) and secondary guardrail metrics (2-3 max)
- Calculate required sample size from MDE and baseline
- Estimate run time in days
- Set pre-defined success criteria before the test runs — no moving goalposts
- Flag design risks: novelty effects, seasonal confounds, multiple testing issues, network effects, sample ratio mismatch
Phase 2: Results Interpretation
- Assess statistical significance (p < 0.05 threshold)
- Assess practical significance: was the lift meaningful for the business, not just real?
- Interpret confidence intervals
- Investigate confounding factors
- Recommend: Ship / Iterate / Kill / Run follow-up test
- Validate — Confirm the test ran for the full planned duration. Flag if it was stopped early (peeking problem). Confirm sample ratio mismatch did not occur.
Output Structure
[Design or Results header based on phase]
Hypothesis: "If we [change], we expect [metric] to [move by X%] because [reason]"
Primary metric: [One metric only] Guardrail metrics: [2-3 max] Required sample size: [n per variant] Estimated run time: [days] Pre-defined success threshold: [specific number] Design risk flags: [any concerns]
Results (Phase 2 only): Statistical significance: [p-value and conclusion] Practical significance: [lift size vs. business threshold] Recommendation: Ship / Iterate / Kill / Follow-up — [rationale]
Quality Checks
- Hypothesis specifies the change, the metric, the direction, and the reason
- Primary metric is singular — guardrail metrics are secondary
- Success criteria are defined before the test launches (not after seeing results)
- Test was not stopped early (or flagged clearly if it was)
- Practical significance assessed separately from statistical significance
- Sample ratio mismatch is checked in results interpretation
Anti-Patterns
- Do not define success criteria after seeing preliminary results — post-hoc success definitions are HARKing (Hypothesising After Results are Known) and invalidate the experiment
- Do not stop a test early because the result looks significant — early stopping dramatically inflates false positive rates; the test must run to the planned sample size
- Do not treat statistical significance as the same as practical significance — a p < 0.05 result with a 0.1% lift is real but may not be worth shipping
- Do not run the same experiment on the same population multiple times without correction — multiple testing inflates the chance of a false positive proportionally
- Do not use more than one primary metric — multiple primary metrics require multiple hypothesis corrections and make the ship/kill decision ambiguous
Related skills
App Store Listing Audit
coreyhaines31
Analyze your app listing against best practices and get a prioritized optimization plan.
Co-Marketing Partnerships
coreyhaines31
Find ideal partners and plan joint marketing campaigns with other companies.
Cold Email Writer
coreyhaines31
Write B2B cold emails and follow-up sequences designed to get replies.
Community-Led Growth
coreyhaines31
Build and grow online communities to drive product adoption and customer loyalty.