← Lessons

quiz vs the machine

Platinum1780

Machine Learning

The Recsys Evaluation Offline

Measuring recommender quality on logged data before any live test.

5 min read · advanced · beat Platinum to climb

Why offline first

Live experiments are slow and risky, so teams first judge recommenders offline on logged interactions. The goal is a metric that predicts online lift well enough to filter ideas before an A B test.

Ranking metrics

  • Recall at K and precision at K count relevant items in the top K.
  • NDCG rewards placing relevant items higher, with a logarithmic position discount.
  • MAP averages precision across the relevant items in the list.

These reward order, which matters since users scan from the top.

Splitting the data

  • A temporal split trains on the past and tests on the future, mirroring deployment.
  • A random split leaks future signal and overstates quality.

The hard part

  • Logs are biased by the old system, so an item never shown looks irrelevant even if it was great.
  • Counterfactual estimators reweight logged data to approximate how a new policy would have scored, but they have high variance.

The honest conclusion

  • Offline metrics rank ideas but rarely give the true online number, so a live test still decides.

Key idea

Offline evaluation uses temporally split logs and ranking metrics like NDCG to filter recommender ideas, but logging bias means a live A B test still makes the final call.

Check yourself

Answer to earn rating on the learn ladder.

1. Why use a temporal split rather than a random split?

2. What does NDCG reward over plain precision at K?

3. Why do offline metrics not fully replace a live test?