TF IDF Vectorization
Plain word counts overvalue common words like the and is. TF IDF fixes this by multiplying a word frequency inside a document by a measure of how rare the word is across the corpus.
The two parts
- Term frequency measures how often a word appears in one document, often the raw count or a scaled count.
- Inverse document frequency measures how informative a word is by counting how many documents contain it. Words in many documents get a low weight, rare words get a high weight.
The product is large when a word is frequent in one document but rare overall, which marks it as distinctive for that document.
Why it helps
TF IDF downweights stopwords automatically without a hand built list, and it highlights words that characterize a document. It produces dense informative weights for search ranking, clustering, and classification, while keeping the simple vector form of bag of words.
Key idea
TF IDF scores a word high when it is frequent in a document yet rare across the corpus, emphasizing distinctive terms over common ones.