A synchronized stampede
The thundering herd problem happens when many clients are waiting on the same event and all wake up and act at the exact same moment, swamping the resource they were waiting for. The surge can knock over a service just as it tries to recover.
Common triggers
- A cache entry expires and every request misses simultaneously, all hitting the database at once.
- A service comes back after an outage and every client reconnects in the same instant.
- A timer fires across many clients on the same schedule.
How to tame it
- Jitter: add randomness to timeouts and retry delays so clients spread out rather than synchronizing.
- Request coalescing: let one request rebuild the cache while others wait.
- Exponential backoff: widen retry gaps so a recovering service is not hit by a wall.
The unifying fix is to break the synchronization that causes everyone to act in lockstep.
Key idea
The thundering herd is a synchronized stampede, and the cure is jitter, backoff, and coalescing to spread the load over time.