The Saturation and Tail Latency

Why response times explode near full utilization and why the tail suffers first.

The cliff near full load

A server feels fine at moderate load and then falls off a cliff as it approaches saturation. Queueing theory explains why: as utilization heads toward one hundred percent, waiting time grows toward infinity.

The utilization curve

For a simple queue the average wait scales roughly with utilization divided by one minus utilization.

At fifty percent utilization the factor is about one.
At ninety percent it is about nine.
At ninety nine percent it is about ninety nine.

So pushing utilization from ninety to ninety nine percent can multiply queueing delay tenfold. This is why running a latency sensitive service hot is dangerous.

Why the tail goes first

The averages above understate the pain. Variance in service time means some requests land behind a long one, and near saturation those unlucky requests wait far longer than the mean. The tail latency, the slow ninety ninth percentile, blows up well before the average does.

A fan out request that waits on many backends sees the tail of each, so its odds of a slow component grow with fan out width.

Key idea

Near saturation queueing delay explodes as utilization approaches one, and variance makes the tail percentile suffer long before the average, so leave headroom.

The Saturation and Tail Latency

The cliff near full load

The utilization curve

Why the tail goes first

Key idea

Check yourself