Retry Storms And Jitter

Why naive retries amplify outages and how jitter breaks the synchronized stampede.

The amplification trap

Retries are meant to recover from a transient blip. But when a dependency is already struggling, every client retrying at once creates a retry storm: the extra load pushes the dependency further down, causing more failures and more retries.

This is a feedback loop that can keep a service down long after the original cause is gone.

Exponential backoff

The first fix is exponential backoff: wait longer after each failed attempt, doubling the delay. This thins out retry traffic so the dependency gets room to recover.

Why jitter is essential

Backoff alone is not enough. If a thousand clients all failed at the same instant, they all back off by the same amount and retry at the same later instant, recreating the spike. Jitter adds randomness to each delay so retries spread out evenly instead of arriving in synchronized waves.

Bounding the damage

Cap the total number of retries so a request cannot retry forever.
Use a retry budget so retries can only be a small fraction of normal traffic.
Do not retry errors that will never succeed, like a bad request.

Key idea