Search Latency Optimization

Why latency is hard

A query touches many shards and stages, and the user waits for the slowest part. Median latency can look fine while the tail, the slow few percent, ruins experience.

Core techniques

Caching stores results for popular queries so they skip retrieval entirely.
Early termination stops scanning a posting list once enough strong candidates are found.
Tiered indexes put high quality documents in a small fast tier searched first.

Taming the tail

Scatter and gather waits for every shard, so one slow shard slows the whole query. Hedged requests send a duplicate to another replica if the first is slow and take whichever returns first. This trades a little extra load for a much tighter tail.

Measure the right thing

Track high percentiles, not just the average. A system tuned only for the mean can still feel slow because users remember the worst responses.

Diagram

Key idea

Latency work mixes caching, early termination, and tiered indexes, plus hedged requests to control the tail that users feel most.