Why static batching wastes time
The simplest server groups requests into a fixed batch and runs them together until all finish. Because requests generate different numbers of tokens, short ones sit idle waiting for the longest one. The GPU runs at the pace of the slowest member and capacity is wasted.
Batching at the token level
Continuous batching, also called in flight batching, manages the batch every generation step rather than per request:
- When a request finishes, it leaves the batch immediately.
- A waiting request can join on the very next step.
- The batch composition changes constantly to stay full.
Because each decode step processes one token per sequence, requests of different lengths mix freely. There is no need to wait for the whole batch to complete.
Impact
By keeping the batch packed, continuous batching raises throughput dramatically and cuts the time new requests wait before starting. It pairs naturally with paged attention, which makes adding and freeing KV cache cheap.
Key idea
Continuous batching adjusts the batch every token step, letting finished requests leave and new ones join so the GPU stays fully utilized.