Two different goals
A serving system is judged on two metrics that often pull apart. Latency is how fast a single request gets its answer. Throughput is how many requests or tokens the system handles per second across everyone. Tuning for one frequently hurts the other.
Why they conflict
Larger batches raise throughput because the GPU does more useful work per pass, but each request waits for the batch to assemble and runs alongside others, so its latency grows. Smaller batches answer one request fast but leave the GPU underused, lowering throughput.
Key measurements
For generation two latency parts matter:
- Time to first token, how long before output starts, set mostly by the prefill phase.
- Time per output token, the steady pace of decoding afterward.
A chat assistant prioritizes low latency, while an offline batch job prioritizes throughput. Operators pick a target and often set a latency budget, then batch as aggressively as that budget allows.
Key idea
Throughput and latency trade off through batch size, so serving means choosing a latency budget and batching as much as that budget permits.