The amplification trap
Retries are meant to recover from a transient blip. But when a dependency is already struggling, every client retrying at once creates a retry storm: the extra load pushes the dependency further down, causing more failures and more retries.
This is a feedback loop that can keep a service down long after the original cause is gone.
Exponential backoff
The first fix is exponential backoff: wait longer after each failed attempt, doubling the delay. This thins out retry traffic so the dependency gets room to recover.
Why jitter is essential
Backoff alone is not enough. If a thousand clients all failed at the same instant, they all back off by the same amount and retry at the same later instant, recreating the spike. Jitter adds randomness to each delay so retries spread out evenly instead of arriving in synchronized waves.
Bounding the damage
- Cap the total number of retries so a request cannot retry forever.
- Use a retry budget so retries can only be a small fraction of normal traffic.
- Do not retry errors that will never succeed, like a bad request.
Key idea
Combine exponential backoff with jitter and retry budgets so retries aid recovery instead of amplifying the outage.