Failures Are Normal
A downstream service times out, a database is briefly unavailable, a network blip drops a call. Many job failures are transient and succeed on a later attempt. Retries recover from these without human help.
Exponential Backoff
Retrying instantly hammers a struggling dependency and often fails again. Instead, wait longer after each failure: one second, then two, then four, then eight. This gives the dependency time to recover and reduces load.
Add Jitter
If many jobs fail at once and all back off on the same schedule, they retry in a synchronized wave, a retry storm. Add random jitter to each delay so retries spread out across time.
Cap the Attempts
Retries must end. Set a maximum attempt count. After the last failed attempt the job moves to a dead letter destination for inspection instead of looping forever.
Retry Only What Makes Sense
- Transient errors such as timeouts deserve retries.
- Permanent errors such as invalid input will fail every time. Detect these and fail fast rather than wasting attempts.
Key idea
Retry transient failures with exponential backoff plus jitter, cap attempts, and send exhausted jobs to a dead letter queue.