← Lessons

quiz vs the machine

Gold1400

System Design

Job Retries With Backoff

Retry failed jobs with exponential delay and jitter to avoid storms.

5 min read · core · beat Gold to climb

Failures Are Normal

A downstream service times out, a database is briefly unavailable, a network blip drops a call. Many job failures are transient and succeed on a later attempt. Retries recover from these without human help.

Exponential Backoff

Retrying instantly hammers a struggling dependency and often fails again. Instead, wait longer after each failure: one second, then two, then four, then eight. This gives the dependency time to recover and reduces load.

Add Jitter

If many jobs fail at once and all back off on the same schedule, they retry in a synchronized wave, a retry storm. Add random jitter to each delay so retries spread out across time.

Cap the Attempts

Retries must end. Set a maximum attempt count. After the last failed attempt the job moves to a dead letter destination for inspection instead of looping forever.

Retry Only What Makes Sense

  • Transient errors such as timeouts deserve retries.
  • Permanent errors such as invalid input will fail every time. Detect these and fail fast rather than wasting attempts.

Key idea

Retry transient failures with exponential backoff plus jitter, cap attempts, and send exhausted jobs to a dead letter queue.

Check yourself

Answer to earn rating on the learn ladder.

1. Why use exponential backoff between retries?

2. What does adding jitter to backoff delays prevent?

3. How should a permanent error such as invalid input be handled?