Retries and Outlier Detection in a Mesh

How proxies recover from transient failures without overwhelming a struggling instance.

Recovering from blips

Networks drop packets and instances stumble. A mesh proxy can retry a failed request, often on a different instance, turning a transient failure into a success the caller never notices.

But retries are dangerous if blind. Retrying every error can multiply load on an already failing system, a retry storm. So proxies retry only idempotent or safe requests, cap the number of attempts, and add backoff between tries.

Removing bad instances

Outlier detection, a form of passive health checking, watches real responses. If one instance returns errors or times out far more than its peers, the proxy ejects it from the pool for a cooldown period, then tentatively returns it.

Retries hide brief, request level failures.
Outlier detection hides longer, instance level failures.
Together they route around trouble automatically.

Tuning the balance

Aggressive retries plus ejection improve success rates but can hide real problems and shrink capacity. Operators bound retries, set budgets so retries stay a small fraction of traffic, and limit how many instances can be ejected at once.

Key idea

Retries recover transient failures while outlier detection ejects consistently bad instances, both bounded so recovery does not amplify load.

Retries and Outlier Detection in a Mesh

Recovering from blips

Removing bad instances

Tuning the balance

Key idea

Check yourself