Timeouts come first
Every remote call needs a timeout. Without one, a hung dependency holds a thread forever. A timeout converts an indefinite wait into a fast, handleable failure.
Retries help and hurt
Retrying a failed call recovers from transient blips. But naive retries are dangerous:
- During an outage, retries multiply load and worsen the failure.
- Retries at many layers stack into a retry storm.
Safer retries
- Use exponential backoff with jitter to spread attempts.
- Cap the number of retries to a small value.
- Only retry idempotent operations.
The budget idea
Set a retry budget: a service may retry only if recent retries stay under a small fraction, say a few percent, of total calls. This caps amplification.
Deadline propagation
Pass a deadline down the call chain so inner hops know the time left and stop work that can never finish in time.
Key idea
Bound every call with a timeout, retry only idempotent calls with backoff under a retry budget, and propagate deadlines so failures do not amplify across hops.