The Cosine Similarity For Text

Once text is a vector, you need a way to judge whether two pieces are similar. Cosine similarity measures the angle between two vectors rather than the distance between their tips. It equals one when vectors point the same way, zero when they are perpendicular, and minus one when they point oppositely.

The angle focus matters for text because of length. A short tweet and a long article about the same topic produce vectors of very different magnitude. Euclidean distance would call them far apart simply because one has bigger counts. Cosine ignores magnitude and asks only about direction, so it captures topic overlap regardless of length.

Concretely, cosine similarity is the dot product of the two vectors divided by the product of their lengths. Dividing by length performs an implicit normalization.

High cosine means the documents share many weighted words
Low cosine means they barely overlap in content
It pairs naturally with TF IDF or embedding vectors

A practical tip is to normalize vectors once to unit length up front. Then cosine similarity reduces to a plain dot product, which is fast and lets search systems compare a query against millions of documents efficiently.

Key idea

Cosine similarity compares the angle between text vectors, capturing topic overlap independent of document length.

The Cosine Similarity For Text

The Cosine Similarity For Text

Key idea

Check yourself