← Lessons

quiz vs the machine

Gold1430

Machine Learning

The Ab Test For Models

Randomly splitting users to prove a new model truly beats the old one.

5 min read · core · beat Gold to climb

Proving a model is better

Offline metrics suggest a new model is good. An A B test confirms it by randomly assigning users to the old model A or the new model B and comparing a business metric on live traffic. Randomization removes confounders so the difference is causal.

Designing the test

  • Pick a single primary metric decided in advance, plus guardrail metrics.
  • Compute the needed sample size from the effect you want to detect and your variance.
  • Run for full business cycles to absorb weekday and weekend effects.

Reading the result

  • Use a significance test and report a confidence interval, not just a point estimate.
  • Beware peeking, stopping early when results look good inflates false positives.
  • A statistically significant but tiny lift may not be worth the operational cost.

Key idea

An A B test randomly splits users between old and new models to measure a causal lift on a preregistered metric, with sample size set in advance and no early peeking.

Check yourself

Answer to earn rating on the learn ladder.

1. Why does randomized assignment matter in a model A B test?

2. Why is peeking and stopping early a problem?

3. What is a guardrail metric for?