← Lessons

quiz vs the machine

Platinum1790

System Design

The Retry and Timeout Budget

Bound waits and retries so failures do not amplify across hops.

5 min read · advanced · beat Platinum to climb

Timeouts come first

Every remote call needs a timeout. Without one, a hung dependency holds a thread forever. A timeout converts an indefinite wait into a fast, handleable failure.

Retries help and hurt

Retrying a failed call recovers from transient blips. But naive retries are dangerous:

  • During an outage, retries multiply load and worsen the failure.
  • Retries at many layers stack into a retry storm.

Safer retries

  • Use exponential backoff with jitter to spread attempts.
  • Cap the number of retries to a small value.
  • Only retry idempotent operations.

The budget idea

Set a retry budget: a service may retry only if recent retries stay under a small fraction, say a few percent, of total calls. This caps amplification.

Deadline propagation

Pass a deadline down the call chain so inner hops know the time left and stop work that can never finish in time.

Key idea

Bound every call with a timeout, retry only idempotent calls with backoff under a retry budget, and propagate deadlines so failures do not amplify across hops.

Check yourself

Answer to earn rating on the learn ladder.

1. Why are naive retries dangerous during an outage?

2. What does deadline propagation achieve?

3. Which calls are safe to retry automatically?