Two kinds of evaluation
Offline evaluation measures a model on a fixed dataset before deployment. You compute metrics such as accuracy or area under the curve on a held out test set. It is fast, cheap, and repeatable.
Online evaluation measures the model on live traffic after deployment. You watch business metrics such as click rate or revenue while real users interact with it.
Why offline is not enough
A great offline score does not guarantee a good online result.
- Offline data can differ from live traffic, so the score is optimistic
- Offline metrics like accuracy may not match the goal, which might be revenue or retention
- A model can change user behavior, which offline data never captures
How they work together
Offline evaluation is a filter. It rejects bad candidates cheaply so only promising models reach live traffic. Online evaluation is the final judge because it measures what actually matters on real users, usually through an experiment.
Key idea
Offline evaluation cheaply filters candidates, but only online evaluation on real traffic proves business value.