← Lessons

quiz vs the machine

Gold1500

System Design

Retry Storms And Jitter

Why naive retries amplify outages and how jitter breaks the synchronized stampede.

5 min read · core · beat Gold to climb

The amplification trap

Retries are meant to recover from a transient blip. But when a dependency is already struggling, every client retrying at once creates a retry storm: the extra load pushes the dependency further down, causing more failures and more retries.

This is a feedback loop that can keep a service down long after the original cause is gone.

Exponential backoff

The first fix is exponential backoff: wait longer after each failed attempt, doubling the delay. This thins out retry traffic so the dependency gets room to recover.

Why jitter is essential

Backoff alone is not enough. If a thousand clients all failed at the same instant, they all back off by the same amount and retry at the same later instant, recreating the spike. Jitter adds randomness to each delay so retries spread out evenly instead of arriving in synchronized waves.

Bounding the damage

  • Cap the total number of retries so a request cannot retry forever.
  • Use a retry budget so retries can only be a small fraction of normal traffic.
  • Do not retry errors that will never succeed, like a bad request.

Key idea

Combine exponential backoff with jitter and retry budgets so retries aid recovery instead of amplifying the outage.

Check yourself

Answer to earn rating on the learn ladder.

1. Why is jitter needed on top of exponential backoff?

2. What is a retry storm?