Why batch at inference
A GPU is most efficient when it processes many inputs at once. Serving one request at a time leaves the device underused. Batching groups requests so a single kernel launch does more work, raising throughput, the number of requests served per second.
The latency tradeoff
Batching helps throughput but can hurt latency for an individual request, which may wait for the batch to fill. The serving system must balance the two.
- A larger maximum batch size raises throughput but can add waiting time.
- A short batch timeout caps how long a request waits before the batch flushes, protecting tail latency.
Dynamic and continuous batching
- Dynamic batching collects requests that arrive within a small window and runs them together, adapting to live traffic.
- Continuous batching, used for text generation, lets new requests join a running batch as soon as others finish their tokens, instead of waiting for the whole batch to complete. This keeps the GPU busy and is far more efficient for variable length outputs.
Key idea
Batching raises GPU throughput at some cost to per request latency; batch size and timeout tune that balance, while continuous batching keeps generation servers efficient under variable length workloads.