A limit that tunes itself
A fixed concurrency limit is a guess that goes stale as capacity changes. An adaptive concurrency limit adjusts the number of in flight requests automatically by watching latency, much like TCP congestion control adjusts its window.
How it senses overload
The idea rests on Little law: when a downstream service saturates, added concurrency stops raising throughput and only inflates latency. So rising latency at a steady throughput is the signal of overload.
- Track a baseline minimum latency, the round trip when the service is unloaded.
- Compare current latency to that baseline. A growing queueing delay means requests are piling up.
- Use a gradient: if measured latency is close to the minimum, raise the limit; if it climbs, lower it.
Additive increase and multiplicative decrease
Like congestion control, a robust scheme increases the limit gently and cuts it sharply on a sign of trouble. This additive increase multiplicative decrease rhythm probes for more capacity but retreats fast when latency spikes or errors appear.
Key idea
An adaptive limit reads queueing delay from latency versus a baseline and grows or shrinks in flight requests, finding capacity without a brittle fixed guess.