← Lessons

quiz vs the machine

Gold1420

System Design

Retries With Backoff and Jitter

Retrying failed calls without stampeding the service you are trying to reach.

5 min read · core · beat Gold to climb

Naive retries make things worse

When a call fails, retrying can help recover from a transient blip. But retrying immediately and on a fixed schedule is dangerous. If a service is overloaded, every client retrying at once piles more load on exactly when it can least handle it.

Exponential backoff

Exponential backoff grows the wait between attempts, often doubling each time. The first retry waits a little, the next waits longer, and so on. This gives a struggling service room to recover instead of hammering it.

Add jitter

Backoff alone still has a flaw. If many clients failed at the same moment, they all back off by the same amount and retry in sync, creating waves. Jitter adds randomness to each wait so retries spread out smoothly.

  • Cap the maximum delay so retries do not wait forever.
  • Cap the number of attempts and then give up or fall back.
  • Only retry idempotent or safe operations.

Key idea

Combine exponential backoff to ease load with jitter to desynchronize clients, and bound both the delay and the number of attempts.

Check yourself

Answer to earn rating on the learn ladder.

1. Why add jitter on top of exponential backoff?

2. Which operations are safest to retry automatically?