The serving dilemma
Online inference requests arrive one at a time, but running them individually leaves the GPU underused. Dynamic batching lets a server group separate requests that arrive close together into one batch, raising utilization without changing the model.
How it works
The server holds a short queue and forms a batch when one of two limits is hit:
- A maximum batch size is reached.
- A maximum wait time elapses, so latency stays bounded.
This bounded wait is the core trade off: waiting longer fills bigger batches and lifts throughput but adds latency to each request.
The batching loop
Continuous batching for generation
Autoregressive models add a twist. With continuous batching the server does not wait for every sequence in a batch to finish. As soon as one sequence completes, a new request takes its slot, keeping the batch full token by token. This sharply improves throughput for token generation workloads.
Key idea
Dynamic batching groups concurrent requests under size and time limits to fill the GPU, and continuous batching keeps generation batches full by swapping in new requests as others finish.