Why latency is hard
A query touches many shards and stages, and the user waits for the slowest part. Median latency can look fine while the tail, the slow few percent, ruins experience.
Core techniques
- Caching stores results for popular queries so they skip retrieval entirely.
- Early termination stops scanning a posting list once enough strong candidates are found.
- Tiered indexes put high quality documents in a small fast tier searched first.
Taming the tail
Scatter and gather waits for every shard, so one slow shard slows the whole query. Hedged requests send a duplicate to another replica if the first is slow and take whichever returns first. This trades a little extra load for a much tighter tail.
Measure the right thing
Track high percentiles, not just the average. A system tuned only for the mean can still feel slow because users remember the worst responses.
Diagram
Key idea
Latency work mixes caching, early termination, and tiered indexes, plus hedged requests to control the tail that users feel most.