The Recsys Evaluation Offline

Why offline first

Live experiments are slow and risky, so teams first judge recommenders offline on logged interactions. The goal is a metric that predicts online lift well enough to filter ideas before an A B test.

Ranking metrics

Recall at K and precision at K count relevant items in the top K.
NDCG rewards placing relevant items higher, with a logarithmic position discount.
MAP averages precision across the relevant items in the list.

These reward order, which matters since users scan from the top.

Splitting the data

A temporal split trains on the past and tests on the future, mirroring deployment.
A random split leaks future signal and overstates quality.

The hard part

Logs are biased by the old system, so an item never shown looks irrelevant even if it was great.
Counterfactual estimators reweight logged data to approximate how a new policy would have scored, but they have high variance.

The honest conclusion

Offline metrics rank ideas but rarely give the true online number, so a live test still decides.

Key idea

Offline evaluation uses temporally split logs and ranking metrics like NDCG to filter recommender ideas, but logging bias means a live A B test still makes the final call.