The Thundering Herd Problem

A synchronized stampede

The thundering herd problem happens when many clients are waiting on the same event and all wake up and act at the exact same moment, swamping the resource they were waiting for. The surge can knock over a service just as it tries to recover.

Common triggers

A cache entry expires and every request misses simultaneously, all hitting the database at once.
A service comes back after an outage and every client reconnects in the same instant.
A timer fires across many clients on the same schedule.

How to tame it

Jitter: add randomness to timeouts and retry delays so clients spread out rather than synchronizing.
Request coalescing: let one request rebuild the cache while others wait.
Exponential backoff: widen retry gaps so a recovering service is not hit by a wall.

The unifying fix is to break the synchronization that causes everyone to act in lockstep.

Key idea

The thundering herd is a synchronized stampede, and the cure is jitter, backoff, and coalescing to spread the load over time.

The Thundering Herd Problem

A synchronized stampede

Common triggers

How to tame it

Key idea

Check yourself