Thundering Herd and Jittered Retries

When clients retry in lockstep they hammer a recovering service, so spread retries with jitter.

The thundering herd

A thundering herd happens when many clients wake at the same instant and rush the same resource. A cache key expires, or a service restarts, and thousands of requests arrive in one spike that overwhelms the backend.

Why naive retries make it worse

If every failed client retries after exactly the same delay, the retries also arrive together. The service recovers, gets slammed, and falls over again in a self reinforcing loop.

The fixes

Exponential backoff grows the wait after each failure, thinning the rate of retries.
Jitter adds randomness to each delay so clients no longer line up. Full jitter picks a random wait between zero and the backoff ceiling.
Request coalescing lets one client recompute a hot value while others wait on that single result.

A simple rule

Backoff alone still synchronizes; jitter is what actually scatters the herd.

Key idea

A thundering herd is synchronized load, and the cure is to desynchronize it with exponential backoff plus random jitter so retries spread out instead of stacking into a new spike.

Thundering Herd and Jittered Retries

The thundering herd

Why naive retries make it worse

The fixes

A simple rule

Key idea

Check yourself