← Lessons

quiz vs the machine

Platinum1820

Machine Learning

Statistical Significance In A B Tests

Tell a real model improvement from random noise before you ship.

6 min read · advanced · beat Platinum to climb

Is the difference real

You launch a new model to half your users and the old one to the other half. The new model looks slightly better. Before celebrating you must ask whether that gap is a real effect or just random noise.

The hypothesis test

  • The null hypothesis says there is no true difference between the variants.
  • The p value is the chance of seeing a gap at least this large if the null were true.
  • A small p value, often below 0.05, suggests the difference is unlikely to be noise.
  • A confidence interval shows the plausible range of the true effect.

Pitfalls to avoid

  • Peeking at results repeatedly and stopping when significant inflates false positives.
  • Running many metrics multiplies the chance one looks significant by luck, so correct for multiple comparisons.
  • A statistically significant gap can still be too small to matter, so weigh practical significance too.
  • Underpowered tests with too few users miss real effects.

Key idea

A B testing checks whether a model improvement beats random noise using p values and confidence intervals. Guard against peeking, multiple comparisons, and tiny effects that are significant but not worth shipping.

Check yourself

Answer to earn rating on the learn ladder.

1. What does a p value below 0.05 suggest?

2. Why is repeated peeking and early stopping dangerous?