The Bag of Words Model
The bag of words model is the simplest way to turn text into numbers a machine can read. You build a vocabulary of every distinct word in your corpus, then represent each document as a vector of counts, one entry per vocabulary word.
The name reveals the key assumption. You pour all the words of a document into a bag and shake it, so order is lost. The phrases the dog bit the man and the man bit the dog produce the identical vector, even though they mean opposite things.
What survives is which words appear and how often. For many tasks that is surprisingly powerful:
- Topic detection, where the word finance signals a finance article
- Spam filtering, where words like winner and free raise suspicion
- A fast baseline before reaching for heavier models
The downsides matter too. Vectors are sparse and high dimensional, since the vocabulary can hold tens of thousands of words. Word order, grammar, and meaning are all discarded. Synonyms get separate columns even though they share meaning.
Despite these limits, bag of words remains a strong starting point. It is cheap to compute, easy to explain, and often good enough for classification.
Key idea
Bag of words represents a document as word counts over a fixed vocabulary, discarding order to gain a simple numeric feature vector.