Beyond keywords
Keyword search misses results that mean the same thing in different words. Semantic search represents text as embeddings, vectors where similar meanings sit close together.
How it works
- An encoder model turns each document into a vector at index time.
- The same encoder turns the query into a vector at query time.
- Retrieval finds documents whose vectors are nearest to the query vector.
Approximate nearest neighbor
Exact nearest neighbor over millions of vectors is too slow, so systems use an approximate nearest neighbor index. It trades a tiny amount of accuracy for a huge speedup by searching only promising regions of the vector space.
Trade offs
- Strength is recall on paraphrases and concepts keyword search misses.
- Weakness is exact term matching, like specific codes or names, where keywords excel.
Because of this, embeddings are usually combined with keyword retrieval rather than replacing it.
Diagram
Key idea
Semantic search encodes text as vectors and retrieves nearest neighbors approximately, capturing meaning that keyword matching misses.