← Lessons

quiz vs the machine

Gold1470

Machine Learning

The Recommendation Evaluation

Reading NDCG, MAP, and recall to judge ranked lists.

6 min read · core · beat Gold to climb

Judging an ordered list

Recommendation quality is about the order of results, not just whether an item is relevant. Specialized metrics reward putting relevant items high and penalize burying them, since users rarely scroll far.

Recall and precision at k

  • Recall at k: of all relevant items, how many appear in the top k.
  • Precision at k: of the top k shown, how many are relevant.
  • These ignore order within the top k, so they are coarse but easy to read.

Mean average precision

MAP averages precision measured at each rank where a relevant item appears, then averages over users. It rewards placing relevant items earlier and captures order, but treats relevance as binary.

Normalized discounted cumulative gain

  • Discounted cumulative gain sums each item's relevance discounted by a function of its rank, so deeper items count less.
  • NDCG divides DCG by the ideal DCG, scaling to one when the list is perfectly ordered.
  • It handles graded relevance and a position discount, making it the standard for ranking quality.

Choosing a metric

Recall at k suits retrieval where coverage matters. NDCG and MAP suit ranking where order matters. Always pair offline metrics with online tests, since offline gains do not guarantee online wins.

Key idea

Ranking metrics like recall at k, MAP, and NDCG reward placing relevant items high, with NDCG handling graded relevance and a position discount as the standard for ordered lists.

Check yourself

Answer to earn rating on the learn ladder.

1. What makes NDCG well suited to evaluating ranked lists?

2. Which metric is most natural for the retrieval stage?

3. Why must offline ranking metrics be paired with online tests?