Why batch at all
A GPU is fastest when it processes many inputs in one pass. Sending one request at a time leaves most of the hardware idle. Batching packs several requests together so each forward pass does more useful work.
The dynamic part
In production requests arrive at random times. Dynamic batching waits a tiny window, collects whatever requests landed, then runs them as one batch. The server forms batches on the fly instead of requiring callers to group inputs themselves.
The core tradeoff
- A longer wait window builds bigger batches and raises throughput.
- A longer wait also adds latency for the first request in the batch.
- A short window keeps latency low but wastes GPU capacity.
Tuning the window
Operators set a maximum wait time and a maximum batch size. The server fires a batch as soon as either limit is hit, balancing how busy the GPU stays against how long any request waits.
Key idea
Dynamic batching trades a small wait for far higher throughput by filling each GPU pass. The wait window and batch size knobs set where you land on the latency versus throughput curve.