Two goals that pull apart
A scheduler can optimize for latency, how quickly a single request finishes, or for throughput, how much total work completes per second. These goals often conflict, and tuning for one can hurt the other.
Why they conflict
- Batching boosts throughput by amortizing fixed costs over many items, but it makes the first item wait, raising latency.
- Frequent preemption lowers latency for urgent tasks by interrupting often, but the extra context switches reduce total throughput.
- Large time slices improve throughput by switching less, yet they delay other waiting tasks.
The queueing view
Latency grows sharply as a system approaches full utilization. Pushing a server to one hundred percent busy maximizes throughput but causes queues, and waiting time, to explode. Latency sensitive systems deliberately run with headroom, leaving capacity idle so bursts do not pile up.
Picking a point
There is no single best answer. A batch analytics job favors throughput and tolerates delay. An interactive service favors latency and accepts lower peak utilization. Good systems make this trade off explicit rather than stumbling into it.
Key idea
Latency and throughput pull against each other, since batching and high utilization raise throughput but inflate waiting time, so systems choose a point with headroom for fast response or full load for maximum work.