← Lessons

quiz vs the machine

Gold1350

System Design

Term Frequency Normalization

Why raw term counts mislead and how length normalization fixes it.

4 min read · core · beat Gold to climb

The problem with raw counts

If you score documents by raw term frequency, long documents win unfairly. A long page mentions any word more times simply because it has more words.

Two corrections

  • Saturation makes each extra occurrence count less than the last. After a few mentions, the term is clearly relevant; more does not prove much.
  • Length normalization divides by a function of document length so a short focused document can compete with a long one.

A worked intuition

Imagine a query term appears five times in a short article and five times in a giant manual. The short article is more about that term. Length normalization captures that intuition by scaling the count against the average document length.

The balance knob

Normalization is tunable. Too strong, and you over penalize long but genuinely thorough documents. Too weak, and verbose pages dominate. Good search systems tune this for their corpus.

Diagram

Key idea

Saturation and length normalization turn raw term counts into a fair signal so short focused documents are not buried by long verbose ones.

Check yourself

Answer to earn rating on the learn ladder.

1. Why do raw term counts favor long documents?

2. What does length normalization let a short document do?