The funnel
Scoring every document with an expensive model is impossible at scale. So search uses a funnel: cheap retrieval pulls many candidates, then progressively costlier rankers refine fewer of them.
Typical stages
- Retrieval uses BM25 or vector search to fetch thousands of candidates cheaply.
- First pass ranking applies a light model to cut to hundreds.
- Reranking applies a heavy model, such as a cross encoder, to a few dozen for the final order.
Why a cross encoder is last
A cross encoder reads the query and document together, which is very accurate but slow. Running it on thousands would be too expensive, so it only sees the small set that survived earlier stages.
Balancing the funnel
Each stage trades cost for quality. Widen early stages to improve recall; spend compute late to improve precision at the top. The art is sizing each stage so the budget lands where it matters most.
Diagram
Key idea
A reranking funnel spends little compute on many candidates and lots on few, putting the most expensive model only where it decides the top results.