← Lessons

quiz vs the machine

Platinum1840

Machine Learning

The Speculative Decoding Deep

Using a small draft model to propose tokens that a big model verifies in parallel.

6 min read · advanced · beat Platinum to climb

The latency problem

Autoregressive decoding generates one token per forward pass of the large model, so each token pays full model latency. Speculative decoding breaks this serial bottleneck without changing the output distribution.

The draft and verify loop

  • A small fast draft model proposes several tokens ahead.
  • The large target model scores all proposed tokens in a single parallel pass.
  • A verification rule accepts a prefix of the draft and rejects the rest.

Why the output stays correct

The acceptance test uses the target model probabilities so that, on average, accepted tokens follow exactly the target distribution.

  • Each draft token is accepted with a probability tied to the ratio of target to draft likelihood.
  • On the first rejection, a corrected token is sampled from an adjusted distribution.
  • This guarantees the final samples match plain target sampling.

What governs the speedup

  • The gain grows with the acceptance rate, which is higher when draft and target agree.
  • A draft that is too large erases the savings; too weak and few tokens are accepted.
  • Variants use the model itself, multiple heads, or a tree of drafts to push acceptance higher.

Key idea

Speculative decoding lets a small draft model propose tokens that the large model verifies in one parallel pass, accepting a prefix so output matches target sampling while cutting latency.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the small draft model do in speculative decoding?

2. Why does speculative decoding preserve the target output distribution?

3. What most governs the speedup from speculative decoding?