Throughput versus Latency in Serving

The tension between serving many requests at once and answering each one quickly.

Two different goals

A serving system is judged on two metrics that often pull apart. Latency is how fast a single request gets its answer. Throughput is how many requests or tokens the system handles per second across everyone. Tuning for one frequently hurts the other.

Why they conflict

Larger batches raise throughput because the GPU does more useful work per pass, but each request waits for the batch to assemble and runs alongside others, so its latency grows. Smaller batches answer one request fast but leave the GPU underused, lowering throughput.

Key measurements

For generation two latency parts matter:

Time to first token, how long before output starts, set mostly by the prefill phase.
Time per output token, the steady pace of decoding afterward.

A chat assistant prioritizes low latency, while an offline batch job prioritizes throughput. Operators pick a target and often set a latency budget, then batch as aggressively as that budget allows.

Key idea

Throughput and latency trade off through batch size, so serving means choosing a latency budget and batching as much as that budget permits.

Throughput versus Latency in Serving

Two different goals

Why they conflict

Key measurements

Key idea

Check yourself