TF IDF Weighting

Plain word counts have a flaw. Common words like the and is appear everywhere, so they dominate the vector yet carry little meaning. TF IDF fixes this by reweighting counts to favor words that are both frequent in a document and rare across the corpus.

The score multiplies two factors:

Term frequency, how often a word appears in the current document
Inverse document frequency, which is large when a word appears in few documents and small when it appears in many

A word that shows up in nearly every document gets a tiny inverse document frequency, so its weight is crushed toward zero. A word that appears in just a handful of documents but often within one of them gets a high weight, marking it as distinctive.

The result is a vector that highlights the words that make a document special. The word galaxy in an astronomy article scores high, while the scores low everywhere.

TF IDF is a workhorse in search ranking and text classification. It is cheap, needs no training beyond counting, and turns the bag of words into a far more discriminating signal. It still ignores order and meaning, but the weighting alone often lifts accuracy noticeably.

Key idea

TF IDF scales word counts by how rare a word is across documents, lifting distinctive terms and suppressing common ones.

TF IDF Weighting

TF IDF Weighting

Key idea

Check yourself