Naive retries make things worse
When a call fails, retrying can help recover from a transient blip. But retrying immediately and on a fixed schedule is dangerous. If a service is overloaded, every client retrying at once piles more load on exactly when it can least handle it.
Exponential backoff
Exponential backoff grows the wait between attempts, often doubling each time. The first retry waits a little, the next waits longer, and so on. This gives a struggling service room to recover instead of hammering it.
Add jitter
Backoff alone still has a flaw. If many clients failed at the same moment, they all back off by the same amount and retry in sync, creating waves. Jitter adds randomness to each wait so retries spread out smoothly.
- Cap the maximum delay so retries do not wait forever.
- Cap the number of attempts and then give up or fall back.
- Only retry idempotent or safe operations.
Key idea
Combine exponential backoff to ease load with jitter to desynchronize clients, and bound both the delay and the number of attempts.