Why a second pass
The vector search that finds candidates uses a bi encoder, which embeds the query and each passage separately, then compares vectors. That is fast but coarse, because the passage never sees the query while being encoded. A cross encoder reranker fixes the top candidates by reading query and passage together.
How a cross encoder differs
- A bi encoder produces one vector per text in advance, so it scales to millions of passages but cannot model fine interactions.
- A cross encoder feeds the query and one passage into a single model at once, letting every query token attend to every passage token, then outputs a relevance score.
This joint attention captures subtle relevance that separate embeddings miss, but it must run once per candidate, so it is far too slow to score the whole index.
The retrieve then rerank pattern
The bi encoder cheaply narrows millions of passages to a few dozen. The cross encoder then carefully reorders just those, putting the truly best passages on top before generation.
Why it matters
This two stage design buys the accuracy of joint encoding at the cost of scoring only a small shortlist, a practical balance of recall and precision.
Key idea
A cross encoder reranker reads query and passage jointly to score relevance precisely, applied only to the bi encoder shortlist so accuracy rises without scoring the whole index.