What it is
Reranking is a second stage in retrieval that reorders an initial candidate list by relevance. A fast retriever pulls many candidates, then a slower, more accurate model scores each one against the query and sorts them.
Why two stages
A single vector search is fast but coarse. It compares a query embedding to document embeddings independently, missing fine interactions.
- The retriever casts a wide net cheaply, returning maybe the top hundred.
- The reranker examines each query and document pair together and outputs a precise score.
Cross encoders
The common reranker is a cross encoder. It feeds the query and a document into one transformer so attention links every query token to every document token. This is far more accurate than comparing separate embeddings, but it cannot be precomputed, so it only runs on the shortlist.
The result is high recall from the retriever and high precision at the top from the reranker, which matters because an LLM usually reads only the first few passages.
Key idea
Reranking adds a precise cross encoder pass over a cheap retriever's shortlist, pushing the most relevant passages to the top where the LLM reads them.