The Inference Batching Dynamic

The serving dilemma

Online inference requests arrive one at a time, but running them individually leaves the GPU underused. Dynamic batching lets a server group separate requests that arrive close together into one batch, raising utilization without changing the model.

How it works

The server holds a short queue and forms a batch when one of two limits is hit:

A maximum batch size is reached.
A maximum wait time elapses, so latency stays bounded.

This bounded wait is the core trade off: waiting longer fills bigger batches and lifts throughput but adds latency to each request.

The batching loop

Continuous batching for generation

Autoregressive models add a twist. With continuous batching the server does not wait for every sequence in a batch to finish. As soon as one sequence completes, a new request takes its slot, keeping the batch full token by token. This sharply improves throughput for token generation workloads.

Key idea

Dynamic batching groups concurrent requests under size and time limits to fill the GPU, and continuous batching keeps generation batches full by swapping in new requests as others finish.

The Inference Batching Dynamic

The serving dilemma

How it works

The batching loop

Continuous batching for generation

Key idea

Check yourself