← Lessons

quiz vs the machine

Silver1090

Machine Learning

TF IDF Weighting

Boosting rare informative words over common ones.

4 min read · intro · beat Silver to climb

TF IDF Weighting

Plain word counts have a flaw. Common words like the and is appear everywhere, so they dominate the vector yet carry little meaning. TF IDF fixes this by reweighting counts to favor words that are both frequent in a document and rare across the corpus.

The score multiplies two factors:

  • Term frequency, how often a word appears in the current document
  • Inverse document frequency, which is large when a word appears in few documents and small when it appears in many

A word that shows up in nearly every document gets a tiny inverse document frequency, so its weight is crushed toward zero. A word that appears in just a handful of documents but often within one of them gets a high weight, marking it as distinctive.

The result is a vector that highlights the words that make a document special. The word galaxy in an astronomy article scores high, while the scores low everywhere.

TF IDF is a workhorse in search ranking and text classification. It is cheap, needs no training beyond counting, and turns the bag of words into a far more discriminating signal. It still ignores order and meaning, but the weighting alone often lifts accuracy noticeably.

Key idea

TF IDF scales word counts by how rare a word is across documents, lifting distinctive terms and suppressing common ones.

Check yourself

Answer to earn rating on the learn ladder.

1. What does inverse document frequency reward?

2. Why does TF IDF suppress words like the?

3. TF IDF combines its two factors by