When retries turn harmful
Retrying a failed request improves reliability, but naive retries are dangerous. During an overload, every client retrying multiplies the traffic hitting an already struggling service, a retry storm that can keep it down. Retry budgets and deadlines keep retries from doing more harm than good.
The two controls
- A deadline is an absolute time by which the whole operation must finish. It propagates with the request, so each downstream hop knows how little time is left and stops retrying once the budget is gone.
- A retry budget caps retries as a fraction of total requests, for example allowing extra attempts only up to ten percent of traffic. When failures spike, the budget runs out and retries stop, protecting the backend.
Retries should use exponential backoff with jitter so clients do not synchronize into pulses. Only idempotent operations are generally safe to retry, since repeating a non idempotent write can duplicate effects. The combination of a propagated deadline and a global budget ensures retries help in isolated failures but back off automatically during systemic overload.
Key idea
Deadlines bound total time and retry budgets cap retries as a fraction of traffic, together preventing retry storms while still recovering from isolated failures.