Two evaluation worlds
Offline evaluation uses logged data. Online evaluation uses live traffic. They answer different questions.
- Offline does the model predict well on held out data
- Online does the model improve the real business metric with real users
Offline pitfalls
- Temporal leakage training on the future and testing on the past
- Distribution shift logged data differs from live traffic
- Proxy mismatch the offline metric does not track the business goal
Always split by time for systems where the future is what you predict. Random splits leak future information.
Bridge the gap
A model that wins offline can still lose online because of latency, feedback effects, or a metric proxy that did not hold. Use offline as a cheap filter and online as the final judge.
Key idea
Offline evaluation screens candidates cheaply; online evaluation on live traffic is the only verdict that counts.