← Lessons

quiz vs the machine

Gold1350

Machine Learning

TF IDF Vectorization

Weight word counts by how rare a word is across the whole corpus.

5 min read · core · beat Gold to climb

TF IDF Vectorization

Plain word counts overvalue common words like the and is. TF IDF fixes this by multiplying a word frequency inside a document by a measure of how rare the word is across the corpus.

The two parts

  • Term frequency measures how often a word appears in one document, often the raw count or a scaled count.
  • Inverse document frequency measures how informative a word is by counting how many documents contain it. Words in many documents get a low weight, rare words get a high weight.

The product is large when a word is frequent in one document but rare overall, which marks it as distinctive for that document.

Why it helps

TF IDF downweights stopwords automatically without a hand built list, and it highlights words that characterize a document. It produces dense informative weights for search ranking, clustering, and classification, while keeping the simple vector form of bag of words.

Key idea

TF IDF scores a word high when it is frequent in a document yet rare across the corpus, emphasizing distinctive terms over common ones.

Check yourself

Answer to earn rating on the learn ladder.

1. What does inverse document frequency reward?

2. Why does TF IDF reduce the influence of stopwords?

3. When is a TF IDF weight largest?