Two phases of generation
Serving a request splits into two phases. Prefill processes the whole input prompt at once to build the KV cache and produce the first token. Decode then generates the rest, one token per step, each step reading the growing cache.
Different bottlenecks
The phases stress hardware differently:
- Prefill handles all prompt tokens in parallel, so it does a lot of math and is usually compute bound, using the GPU heavily.
- Decode processes a single token per step but must read the full KV cache each time, so it is usually memory bandwidth bound and leaves compute underused.
Because their profiles differ, mixing them in one batch can hurt. Decode steps are short and frequent, while a long prefill can stall everyone waiting behind it.
Serving implications
- Time to first token comes from prefill, so prompt length drives it.
- Per token speed comes from decode, set by cache reads and batch size.
- Systems may separate prefill and decode onto different workers, or chunk long prefills, so a big prompt does not block ongoing decoding.
Key idea
Prefill processes the prompt in parallel and is compute bound, while decode generates token by token and is memory bound, so serving systems often treat the two phases separately.