← Lessons

quiz vs the machine

Platinum1820

Machine Learning

The Prefill and Decode Phases

Why processing the prompt and generating tokens have very different performance profiles.

5 min read · advanced · beat Platinum to climb

Two phases of generation

Serving a request splits into two phases. Prefill processes the whole input prompt at once to build the KV cache and produce the first token. Decode then generates the rest, one token per step, each step reading the growing cache.

Different bottlenecks

The phases stress hardware differently:

  • Prefill handles all prompt tokens in parallel, so it does a lot of math and is usually compute bound, using the GPU heavily.
  • Decode processes a single token per step but must read the full KV cache each time, so it is usually memory bandwidth bound and leaves compute underused.

Because their profiles differ, mixing them in one batch can hurt. Decode steps are short and frequent, while a long prefill can stall everyone waiting behind it.

Serving implications

  • Time to first token comes from prefill, so prompt length drives it.
  • Per token speed comes from decode, set by cache reads and batch size.
  • Systems may separate prefill and decode onto different workers, or chunk long prefills, so a big prompt does not block ongoing decoding.

Key idea

Prefill processes the prompt in parallel and is compute bound, while decode generates token by token and is memory bound, so serving systems often treat the two phases separately.

Check yourself

Answer to earn rating on the learn ladder.

1. Which phase is usually compute bound?

2. Why is decode often memory bandwidth bound?

3. What does prefill mainly determine for the user?