← Lessons

quiz vs the machine

Platinum1720

Machine Learning

The Cosine Similarity For Text

Comparing documents by the angle between vectors.

4 min read · advanced · beat Platinum to climb

The Cosine Similarity For Text

Once text is a vector, you need a way to judge whether two pieces are similar. Cosine similarity measures the angle between two vectors rather than the distance between their tips. It equals one when vectors point the same way, zero when they are perpendicular, and minus one when they point oppositely.

The angle focus matters for text because of length. A short tweet and a long article about the same topic produce vectors of very different magnitude. Euclidean distance would call them far apart simply because one has bigger counts. Cosine ignores magnitude and asks only about direction, so it captures topic overlap regardless of length.

Concretely, cosine similarity is the dot product of the two vectors divided by the product of their lengths. Dividing by length performs an implicit normalization.

  • High cosine means the documents share many weighted words
  • Low cosine means they barely overlap in content
  • It pairs naturally with TF IDF or embedding vectors

A practical tip is to normalize vectors once to unit length up front. Then cosine similarity reduces to a plain dot product, which is fast and lets search systems compare a query against millions of documents efficiently.

Key idea

Cosine similarity compares the angle between text vectors, capturing topic overlap independent of document length.

Check yourself

Answer to earn rating on the learn ladder.

1. What does cosine similarity measure?

2. Why is cosine preferred over Euclidean distance for text?

3. After normalizing vectors to unit length, cosine similarity becomes