Proving a model is better
Offline metrics suggest a new model is good. An A B test confirms it by randomly assigning users to the old model A or the new model B and comparing a business metric on live traffic. Randomization removes confounders so the difference is causal.
Designing the test
- Pick a single primary metric decided in advance, plus guardrail metrics.
- Compute the needed sample size from the effect you want to detect and your variance.
- Run for full business cycles to absorb weekday and weekend effects.
Reading the result
- Use a significance test and report a confidence interval, not just a point estimate.
- Beware peeking, stopping early when results look good inflates false positives.
- A statistically significant but tiny lift may not be worth the operational cost.
Key idea
An A B test randomly splits users between old and new models to measure a causal lift on a preregistered metric, with sample size set in advance and no early peeking.