← Lessons

quiz vs the machine

Gold1440

System Design

Retry and Timeout Policies

Setting bounded retries and deadlines in the mesh without overloading downstreams.

5 min read · core · beat Gold to climb

Reliability With Limits

Retries and timeouts make calls more reliable, but used carelessly they cause outages. The mesh lets you set these as policy and adds guardrails the app might forget.

Timeouts First

Every call should have a timeout. Without one, a stuck downstream holds the caller forever. The mesh enforces a deadline per route, so a hung dependency returns an error instead of leaking resources.

Retries Done Right

  • Retry only idempotent operations, since retrying a write can double an effect.
  • Cap the number of attempts so a failure does not multiply load.
  • Use backoff with jitter to spread retries out in time.

The dangerous case is the retry storm. If every layer retries three times, a single failure can balloon into many times the traffic. The mesh supports a retry budget that limits retries to a fraction of active requests.

Key idea

The mesh enforces per route timeouts and bounded, idempotent retries with backoff and a retry budget so a single failure cannot snowball into a retry storm.

Check yourself

Answer to earn rating on the learn ladder.

1. Which operations are safe to retry automatically?

2. What does a retry budget prevent?

3. Why does every call need a timeout?