The Prefill and Decode Phases

Why processing the prompt and generating tokens have very different performance profiles.

Two phases of generation

Serving a request splits into two phases. Prefill processes the whole input prompt at once to build the KV cache and produce the first token. Decode then generates the rest, one token per step, each step reading the growing cache.

Different bottlenecks

The phases stress hardware differently:

Prefill handles all prompt tokens in parallel, so it does a lot of math and is usually compute bound, using the GPU heavily.
Decode processes a single token per step but must read the full KV cache each time, so it is usually memory bandwidth bound and leaves compute underused.

Because their profiles differ, mixing them in one batch can hurt. Decode steps are short and frequent, while a long prefill can stall everyone waiting behind it.

Serving implications

Time to first token comes from prefill, so prompt length drives it.
Per token speed comes from decode, set by cache reads and batch size.
Systems may separate prefill and decode onto different workers, or chunk long prefills, so a big prompt does not block ongoing decoding.

Key idea

Prefill processes the prompt in parallel and is compute bound, while decode generates token by token and is memory bound, so serving systems often treat the two phases separately.

The Prefill and Decode Phases

Two phases of generation

Different bottlenecks

Serving implications

Key idea

Check yourself