TF IDF Vectorization

Plain word counts overvalue common words like the and is. TF IDF fixes this by multiplying a word frequency inside a document by a measure of how rare the word is across the corpus.

The two parts

Term frequency measures how often a word appears in one document, often the raw count or a scaled count.
Inverse document frequency measures how informative a word is by counting how many documents contain it. Words in many documents get a low weight, rare words get a high weight.

The product is large when a word is frequent in one document but rare overall, which marks it as distinctive for that document.

Why it helps

TF IDF downweights stopwords automatically without a hand built list, and it highlights words that characterize a document. It produces dense informative weights for search ranking, clustering, and classification, while keeping the simple vector form of bag of words.

Key idea

TF IDF scores a word high when it is frequent in a document yet rare across the corpus, emphasizing distinctive terms over common ones.

TF IDF Vectorization