quiz vs the machine

Gold1420

System Design

TF IDF and BM25 Ranking

Scoring how relevant a document is to a query, not just whether it matches.

6 min read · core · beat Gold to climb

Beyond matching

An inverted index tells you which documents match. Ranking tells you how well. The classic intuition has two parts.

Term frequency rewards documents where the query word appears often.
Inverse document frequency rewards rare words and discounts common ones, because a word that appears in every document carries little signal.

Multiplying these gives the familiar tf idf weight.

Why BM25 wins

Plain term frequency grows without limit, so a page that repeats a word a thousand times scores absurdly high. BM25 fixes two problems:

Saturation means extra occurrences add less and less, controlled by a parameter often called k.
Length normalization stops long documents from winning just by holding more words, controlled by a parameter often called b.

These corrections make BM25 the default lexical scorer in most engines because it matches human judgments better than raw tf idf.

Key idea

BM25 scores relevance by combining term frequency, term rarity, and document length, with saturation so repeated words cannot dominate.

Check yourself

Answer to earn rating on the learn ladder.

1. What does inverse document frequency reward?

2. What problem does BM25 saturation solve?

3. Why does BM25 apply length normalization?