What it is
An A B test randomly splits users into groups. The control group sees the current model and the treatment group sees the new model. You then compare a business metric between the groups to decide which model is better.
Why randomization matters
Random assignment makes the groups statistically similar, so any difference in the metric can be attributed to the model rather than to who happened to use it. Without randomization, a confounding factor like time of day could fool you.
Reading the result
You do not just compare raw averages, because noise can produce a difference by chance.
- You compute the difference and its statistical significance, often a p value
- You decide ahead of time the sample size needed to detect a meaningful effect
- You define a clear primary metric so you are not fishing across many metrics
Common pitfalls
- Peeking and stopping early when a result looks good inflates false positives
- A novelty effect can make a new model look better simply because it is new
- Testing many metrics at once raises the chance of a fluke win
Key idea
An A B test uses random assignment and a significance test so the measured difference reflects the model, not chance.