The latency problem
Autoregressive decoding generates one token per forward pass of the large model, so each token pays full model latency. Speculative decoding breaks this serial bottleneck without changing the output distribution.
The draft and verify loop
- A small fast draft model proposes several tokens ahead.
- The large target model scores all proposed tokens in a single parallel pass.
- A verification rule accepts a prefix of the draft and rejects the rest.
Why the output stays correct
The acceptance test uses the target model probabilities so that, on average, accepted tokens follow exactly the target distribution.
- Each draft token is accepted with a probability tied to the ratio of target to draft likelihood.
- On the first rejection, a corrected token is sampled from an adjusted distribution.
- This guarantees the final samples match plain target sampling.
What governs the speedup
- The gain grows with the acceptance rate, which is higher when draft and target agree.
- A draft that is too large erases the savings; too weak and few tokens are accepted.
- Variants use the model itself, multiple heads, or a tree of drafts to push acceptance higher.
Key idea
Speculative decoding lets a small draft model propose tokens that the large model verifies in one parallel pass, accepting a prefix so output matches target sampling while cutting latency.