Beyond matching
An inverted index tells you which documents match. Ranking tells you how well. The classic intuition has two parts.
- Term frequency rewards documents where the query word appears often.
- Inverse document frequency rewards rare words and discounts common ones, because a word that appears in every document carries little signal.
Multiplying these gives the familiar tf idf weight.
Why BM25 wins
Plain term frequency grows without limit, so a page that repeats a word a thousand times scores absurdly high. BM25 fixes two problems:
- Saturation means extra occurrences add less and less, controlled by a parameter often called k.
- Length normalization stops long documents from winning just by holding more words, controlled by a parameter often called b.
These corrections make BM25 the default lexical scorer in most engines because it matches human judgments better than raw tf idf.
Key idea
BM25 scores relevance by combining term frequency, term rarity, and document length, with saturation so repeated words cannot dominate.