The problem with raw counts
If you score documents by raw term frequency, long documents win unfairly. A long page mentions any word more times simply because it has more words.
Two corrections
- Saturation makes each extra occurrence count less than the last. After a few mentions, the term is clearly relevant; more does not prove much.
- Length normalization divides by a function of document length so a short focused document can compete with a long one.
A worked intuition
Imagine a query term appears five times in a short article and five times in a giant manual. The short article is more about that term. Length normalization captures that intuition by scaling the count against the average document length.
The balance knob
Normalization is tunable. Too strong, and you over penalize long but genuinely thorough documents. Too weak, and verbose pages dominate. Good search systems tune this for their corpus.
Diagram
Key idea
Saturation and length normalization turn raw term counts into a fair signal so short focused documents are not buried by long verbose ones.