Why offline first
Live experiments are slow and risky, so teams first judge recommenders offline on logged interactions. The goal is a metric that predicts online lift well enough to filter ideas before an A B test.
Ranking metrics
- Recall at K and precision at K count relevant items in the top K.
- NDCG rewards placing relevant items higher, with a logarithmic position discount.
- MAP averages precision across the relevant items in the list.
These reward order, which matters since users scan from the top.
Splitting the data
- A temporal split trains on the past and tests on the future, mirroring deployment.
- A random split leaks future signal and overstates quality.
The hard part
- Logs are biased by the old system, so an item never shown looks irrelevant even if it was great.
- Counterfactual estimators reweight logged data to approximate how a new policy would have scored, but they have high variance.
The honest conclusion
- Offline metrics rank ideas but rarely give the true online number, so a live test still decides.
Key idea
Offline evaluation uses temporally split logs and ranking metrics like NDCG to filter recommender ideas, but logging bias means a live A B test still makes the final call.