A/B testing is the gold standard for understanding causation, not just correlation. Done right, it’s the engine behind compounding growth at the world’s best product companies.
What Is an A/B Test?
An A/B test (also called a split test or controlled experiment) is a randomised experiment where you expose two groups of users to different versions of an experience — a control (A) and a variant (B) — and measure which performs better on a predefined metric.
The Five Steps of a Good Experiment
1. Define Your Hypothesis
A good hypothesis has three parts: observation, proposed change, and expected outcome.
Example: “Because our pricing page has too much text, simplifying it to three bullet points per tier will increase free-to-paid conversion by 15%.“
2. Choose Your Primary Metric
Pick one metric that your test is designed to move. Having multiple primary metrics increases your false positive rate. Secondary metrics are fine to monitor, but shouldn’t drive the decision.
3. Calculate Sample Size
Before running your test, calculate how many users you need per variant to detect your minimum detectable effect (MDE) at your desired statistical power (typically 80%) and significance level (typically 95%).
Running tests before reaching sample size is the single most common mistake teams make.
4. Run the Test
Expose users to variants randomly. Do not look at results until you’ve hit your sample size — this is called “peeking” and inflates false positive rates significantly.
5. Interpret Results Correctly
Statistical significance does not mean practical significance. A result can be statistically significant (p < 0.05) but so small it doesn’t matter for your business. Always look at the effect size alongside the p-value.
Common Mistakes to Avoid
- Stopping early when you see a positive result
- Running tests too long (introduces time-of-week effects)
- Testing multiple things at once in a single variant
- Not segmenting results by device type, user plan, or acquisition source
- Ignoring novelty effects — new features often get a short-term lift that fades
Bayesian vs. Frequentist Testing
Most teams use frequentist testing (p-values, confidence intervals). Bayesian testing is increasingly popular because it lets you incorporate prior knowledge and gives more intuitive output (“there’s a 94% probability that B beats A”).
For most product teams, the choice of method matters less than the discipline of running tests correctly.