The only real proof
Offline gains are a hypothesis. An AB test splits live traffic between the current model and the candidate to measure the true causal effect.
How it works
- Control users keep the current model
- Treatment users get the new model
- Randomize assignment so the groups are comparable
- Measure the primary business metric over a fixed window
Statistical care
- Pick the primary metric and sample size before you start
- Run long enough for significance, avoiding early peeking
- Watch guardrail metrics so a win on one metric does not hide harm elsewhere
Beware interference
In marketplaces or social graphs, treatment users can affect control users, breaking independence. Use cluster or geo based splits when that risk is real.
Key idea
An AB test is the causal verdict: randomize traffic, pre register the metric and sample size, and guard against peeking and interference.