← Lessons

quiz vs the machine

Platinum1820

Machine Learning

The Offline Online Metric Gap

Why a model that wins offline can still lose in production.

6 min read · advanced · beat Platinum to climb

Two different worlds

Recommenders are tuned offline on logged data, then judged online by live user behavior. A frustrating reality is the offline online gap: a model with better offline metrics sometimes performs worse in an actual experiment, and vice versa.

Where the gap comes from

  • Distribution shift: offline data was collected by the old system, so it reflects past choices, not what the new model would show.
  • Feedback effects: online, the new model changes what users see, generating data the offline evaluation never had.
  • Metric mismatch: offline accuracy may not track the online goal like long term retention or revenue.
  • Position and selection bias: logged labels are tangled with where and whether items were shown.

Closing the gap

  • Use counterfactual or off policy estimators that reweight logged data toward the new policy.
  • Hold out an online evaluation through A B testing as the source of truth.
  • Choose offline metrics that correlate with online outcomes, validated over many past launches.

The discipline

Treat offline metrics as a filter that decides which models earn an expensive online test, not as the final verdict. Track how well offline gains predict online gains so the filter keeps improving.

Key idea

The offline online gap arises from distribution shift, feedback effects, and metric mismatch; offline metrics filter candidates while online A B tests remain the source of truth.

Check yourself

Answer to earn rating on the learn ladder.

1. What is a core cause of the offline online metric gap?

2. How should offline metrics be used given this gap?