Statistical Significance In A B Tests

Is the difference real

You launch a new model to half your users and the old one to the other half. The new model looks slightly better. Before celebrating you must ask whether that gap is a real effect or just random noise.

The hypothesis test

The null hypothesis says there is no true difference between the variants.
The p value is the chance of seeing a gap at least this large if the null were true.
A small p value, often below 0.05, suggests the difference is unlikely to be noise.
A confidence interval shows the plausible range of the true effect.

Pitfalls to avoid

Peeking at results repeatedly and stopping when significant inflates false positives.
Running many metrics multiplies the chance one looks significant by luck, so correct for multiple comparisons.
A statistically significant gap can still be too small to matter, so weigh practical significance too.
Underpowered tests with too few users miss real effects.

Key idea

A B testing checks whether a model improvement beats random noise using p values and confidence intervals. Guard against peeking, multiple comparisons, and tiny effects that are significant but not worth shipping.

Statistical Significance In A B Tests

Is the difference real

The hypothesis test

Pitfalls to avoid

Key idea

Check yourself