Is the difference real
You launch a new model to half your users and the old one to the other half. The new model looks slightly better. Before celebrating you must ask whether that gap is a real effect or just random noise.
The hypothesis test
- The null hypothesis says there is no true difference between the variants.
- The p value is the chance of seeing a gap at least this large if the null were true.
- A small p value, often below 0.05, suggests the difference is unlikely to be noise.
- A confidence interval shows the plausible range of the true effect.
Pitfalls to avoid
- Peeking at results repeatedly and stopping when significant inflates false positives.
- Running many metrics multiplies the chance one looks significant by luck, so correct for multiple comparisons.
- A statistically significant gap can still be too small to matter, so weigh practical significance too.
- Underpowered tests with too few users miss real effects.
Key idea
A B testing checks whether a model improvement beats random noise using p values and confidence intervals. Guard against peeking, multiple comparisons, and tiny effects that are significant but not worth shipping.