The Ab Test For Models

Proving a model is better

Offline metrics suggest a new model is good. An A B test confirms it by randomly assigning users to the old model A or the new model B and comparing a business metric on live traffic. Randomization removes confounders so the difference is causal.

Designing the test

Pick a single primary metric decided in advance, plus guardrail metrics.
Compute the needed sample size from the effect you want to detect and your variance.
Run for full business cycles to absorb weekday and weekend effects.

Reading the result

Use a significance test and report a confidence interval, not just a point estimate.
Beware peeking, stopping early when results look good inflates false positives.
A statistically significant but tiny lift may not be worth the operational cost.

Key idea

An A B test randomly splits users between old and new models to measure a causal lift on a preregistered metric, with sample size set in advance and no early peeking.

The Ab Test For Models

Proving a model is better

Designing the test

Reading the result

Key idea

Check yourself