← Lessons

quiz vs the machine

Platinum1780

Machine Learning

The Inference Batching Dynamic

Grouping incoming requests on the fly to boost serving throughput.

5 min read · advanced · beat Platinum to climb

The serving dilemma

Online inference requests arrive one at a time, but running them individually leaves the GPU underused. Dynamic batching lets a server group separate requests that arrive close together into one batch, raising utilization without changing the model.

How it works

The server holds a short queue and forms a batch when one of two limits is hit:

  • A maximum batch size is reached.
  • A maximum wait time elapses, so latency stays bounded.

This bounded wait is the core trade off: waiting longer fills bigger batches and lifts throughput but adds latency to each request.

The batching loop

Continuous batching for generation

Autoregressive models add a twist. With continuous batching the server does not wait for every sequence in a batch to finish. As soon as one sequence completes, a new request takes its slot, keeping the batch full token by token. This sharply improves throughput for token generation workloads.

Key idea

Dynamic batching groups concurrent requests under size and time limits to fill the GPU, and continuous batching keeps generation batches full by swapping in new requests as others finish.

Check yourself

Answer to earn rating on the learn ladder.

1. What two limits trigger a dynamic batch to run?

2. How does continuous batching help generation workloads?