← Lessons

quiz vs the machine

Gold1480

Machine Learning

AB Testing In Production

Comparing two models on live traffic to measure real impact.

5 min read · core · beat Gold to climb

Why test live

Offline metrics predict but do not prove real world impact. An AB test splits live traffic so a new model and the current model run side by side, letting you measure the business metric that matters.

How it works

  • Users are randomly assigned to a control or treatment group by a stable key.
  • The control sees model A, the treatment sees model B.
  • You measure the target metric per group, such as conversion or click rate.
  • A statistical test decides whether the difference is real or noise.

Getting it right

  • Randomization must be consistent per user so a person stays in one group.
  • The test must run long enough to reach statistical significance.
  • Pick one primary metric ahead of time to avoid cherry picking.

Pitfalls

Beware peeking at results early, which inflates false positives, and ignoring guardrail metrics that the new model might quietly harm.

Key idea

AB testing randomly splits live traffic between two models and uses a statistical test on a preregistered metric to prove which one truly performs better.

Check yourself

Answer to earn rating on the learn ladder.

1. What does an AB test let you measure that offline evaluation cannot?

2. Why must user assignment to a group be consistent per user?

3. Why is peeking at results early a problem?