Continuous Batching

Adding and removing requests from a running batch every step to keep the GPU busy.

Why static batching wastes time

The simplest server groups requests into a fixed batch and runs them together until all finish. Because requests generate different numbers of tokens, short ones sit idle waiting for the longest one. The GPU runs at the pace of the slowest member and capacity is wasted.

Batching at the token level

Continuous batching, also called in flight batching, manages the batch every generation step rather than per request:

When a request finishes, it leaves the batch immediately.
A waiting request can join on the very next step.
The batch composition changes constantly to stay full.

Because each decode step processes one token per sequence, requests of different lengths mix freely. There is no need to wait for the whole batch to complete.

Impact

By keeping the batch packed, continuous batching raises throughput dramatically and cuts the time new requests wait before starting. It pairs naturally with paged attention, which makes adding and freeing KV cache cheap.

Key idea

Continuous batching adjusts the batch every token step, letting finished requests leave and new ones join so the GPU stays fully utilized.

Continuous Batching

Why static batching wastes time

Batching at the token level

Impact

Key idea

Check yourself