The bottleneck
Large language models generate one token per forward pass, and each pass over a big model is slow. Most of the time the next token is easy and even a small model would have guessed it. Speculative decoding exploits this.
How it works
A small fast draft model proposes several tokens ahead. The large target model then checks all of those proposed tokens in a single forward pass.
- The draft model writes a short guess of several tokens
- The target model scores that whole guess in one pass
- Tokens that match what the target would have chosen are accepted
- At the first mismatch, the target supplies the correct token and drafting restarts
Why it is exact
The acceptance test is designed so the final output has the same distribution as decoding from the target model alone. Speculative decoding is a speed trick, not an approximation. The quality is identical to the big model running by itself.
The payoff
When the draft model is accurate, many tokens are accepted per target pass, giving a large speedup. When it guesses poorly, fewer are accepted, but correctness is never compromised because the target always has the final say.
Key idea
Speculative decoding lets a small model guess ahead while the big model verifies in one pass, speeding generation with identical output quality.