Reliability With Limits
Retries and timeouts make calls more reliable, but used carelessly they cause outages. The mesh lets you set these as policy and adds guardrails the app might forget.
Timeouts First
Every call should have a timeout. Without one, a stuck downstream holds the caller forever. The mesh enforces a deadline per route, so a hung dependency returns an error instead of leaking resources.
Retries Done Right
- Retry only idempotent operations, since retrying a write can double an effect.
- Cap the number of attempts so a failure does not multiply load.
- Use backoff with jitter to spread retries out in time.
The dangerous case is the retry storm. If every layer retries three times, a single failure can balloon into many times the traffic. The mesh supports a retry budget that limits retries to a fraction of active requests.
Key idea
The mesh enforces per route timeouts and bounded, idempotent retries with backoff and a retry budget so a single failure cannot snowball into a retry storm.