Adaptive Concurrency Limits

Let a service discover its own safe in flight limit from latency feedback instead of a fixed guess.

The problem with fixed limits

A hardcoded concurrency limit is a guess that goes stale. Set it too low and you waste capacity; set it too high and a slow dependency lets queues build until the service falls over.

How adaptive limits work

Adaptive concurrency limits borrow ideas from network congestion control. The service watches its own latency and adjusts the number of allowed in flight requests.

Latency stays low means there is spare capacity, so the limit grows.
Latency rises signals queuing, so the limit shrinks to drain the backlog.

Why latency is the signal

Latency reflects the real state of the system, including slow downstreams the service cannot see directly. Algorithms like the gradient method compare a recent latency to a long term minimum and back off when the ratio worsens.

Benefits and care

Self tuning: the limit tracks changing hardware and dependency health automatically.
Pairs with shedding: requests beyond the limit are shed quickly rather than queued.
Noisy signals: smoothing prevents the limit from oscillating wildly.

Key idea