Why test live
Offline metrics predict but do not prove real world impact. An AB test splits live traffic so a new model and the current model run side by side, letting you measure the business metric that matters.
How it works
- Users are randomly assigned to a control or treatment group by a stable key.
- The control sees model A, the treatment sees model B.
- You measure the target metric per group, such as conversion or click rate.
- A statistical test decides whether the difference is real or noise.
Getting it right
- Randomization must be consistent per user so a person stays in one group.
- The test must run long enough to reach statistical significance.
- Pick one primary metric ahead of time to avoid cherry picking.
Pitfalls
Beware peeking at results early, which inflates false positives, and ignoring guardrail metrics that the new model might quietly harm.
Key idea
AB testing randomly splits live traffic between two models and uses a statistical test on a preregistered metric to prove which one truly performs better.