The Speculative Decoding Deep

Using a small draft model to propose tokens that a big model verifies in parallel.

The latency problem

Autoregressive decoding generates one token per forward pass of the large model, so each token pays full model latency. Speculative decoding breaks this serial bottleneck without changing the output distribution.

The draft and verify loop

A small fast draft model proposes several tokens ahead.
The large target model scores all proposed tokens in a single parallel pass.
A verification rule accepts a prefix of the draft and rejects the rest.

Why the output stays correct

The acceptance test uses the target model probabilities so that, on average, accepted tokens follow exactly the target distribution.

Each draft token is accepted with a probability tied to the ratio of target to draft likelihood.
On the first rejection, a corrected token is sampled from an adjusted distribution.
This guarantees the final samples match plain target sampling.

What governs the speedup

The gain grows with the acceptance rate, which is higher when draft and target agree.
A draft that is too large erases the savings; too weak and few tokens are accepted.
Variants use the model itself, multiple heads, or a tree of drafts to push acceptance higher.

Key idea