The Retry and Timeout Budget

Timeouts come first

Every remote call needs a timeout. Without one, a hung dependency holds a thread forever. A timeout converts an indefinite wait into a fast, handleable failure.

Retries help and hurt

Retrying a failed call recovers from transient blips. But naive retries are dangerous:

During an outage, retries multiply load and worsen the failure.
Retries at many layers stack into a retry storm.

Safer retries

Use exponential backoff with jitter to spread attempts.
Cap the number of retries to a small value.
Only retry idempotent operations.

The budget idea

Set a retry budget: a service may retry only if recent retries stay under a small fraction, say a few percent, of total calls. This caps amplification.

Deadline propagation

Pass a deadline down the call chain so inner hops know the time left and stop work that can never finish in time.

Key idea

Bound every call with a timeout, retry only idempotent calls with backoff under a retry budget, and propagate deadlines so failures do not amplify across hops.