The cliff near full load
A server feels fine at moderate load and then falls off a cliff as it approaches saturation. Queueing theory explains why: as utilization heads toward one hundred percent, waiting time grows toward infinity.
The utilization curve
For a simple queue the average wait scales roughly with utilization divided by one minus utilization.
- At fifty percent utilization the factor is about one.
- At ninety percent it is about nine.
- At ninety nine percent it is about ninety nine.
So pushing utilization from ninety to ninety nine percent can multiply queueing delay tenfold. This is why running a latency sensitive service hot is dangerous.
Why the tail goes first
The averages above understate the pain. Variance in service time means some requests land behind a long one, and near saturation those unlucky requests wait far longer than the mean. The tail latency, the slow ninety ninth percentile, blows up well before the average does.
- A fan out request that waits on many backends sees the tail of each, so its odds of a slow component grow with fan out width.
Key idea
Near saturation queueing delay explodes as utilization approaches one, and variance makes the tail percentile suffer long before the average, so leave headroom.