Two different worlds
Recommenders are tuned offline on logged data, then judged online by live user behavior. A frustrating reality is the offline online gap: a model with better offline metrics sometimes performs worse in an actual experiment, and vice versa.
Where the gap comes from
- Distribution shift: offline data was collected by the old system, so it reflects past choices, not what the new model would show.
- Feedback effects: online, the new model changes what users see, generating data the offline evaluation never had.
- Metric mismatch: offline accuracy may not track the online goal like long term retention or revenue.
- Position and selection bias: logged labels are tangled with where and whether items were shown.
Closing the gap
- Use counterfactual or off policy estimators that reweight logged data toward the new policy.
- Hold out an online evaluation through A B testing as the source of truth.
- Choose offline metrics that correlate with online outcomes, validated over many past launches.
The discipline
Treat offline metrics as a filter that decides which models earn an expensive online test, not as the final verdict. Track how well offline gains predict online gains so the filter keeps improving.
Key idea
The offline online gap arises from distribution shift, feedback effects, and metric mismatch; offline metrics filter candidates while online A B tests remain the source of truth.