Recovering from blips
Networks drop packets and instances stumble. A mesh proxy can retry a failed request, often on a different instance, turning a transient failure into a success the caller never notices.
But retries are dangerous if blind. Retrying every error can multiply load on an already failing system, a retry storm. So proxies retry only idempotent or safe requests, cap the number of attempts, and add backoff between tries.
Removing bad instances
Outlier detection, a form of passive health checking, watches real responses. If one instance returns errors or times out far more than its peers, the proxy ejects it from the pool for a cooldown period, then tentatively returns it.
- Retries hide brief, request level failures.
- Outlier detection hides longer, instance level failures.
- Together they route around trouble automatically.
Tuning the balance
Aggressive retries plus ejection improve success rates but can hide real problems and shrink capacity. Operators bound retries, set budgets so retries stay a small fraction of traffic, and limit how many instances can be ejected at once.
Key idea
Retries recover transient failures while outlier detection ejects consistently bad instances, both bounded so recovery does not amplify load.