The latency bottleneck
Large language models generate one token per forward pass, and each pass is slow. Latency is dominated by the number of sequential passes through the big model. Speculative decoding attacks exactly this count.
The draft and verify trick
A small fast draft model guesses several tokens ahead. The large target model then checks all those guesses in a single forward pass. Tokens that match what the target would have chosen are accepted; the first mismatch is corrected and the rest are discarded.
Why it is correct
- The target verifies every guessed token, so the final output matches what the target alone would produce.
- Quality is preserved exactly; only speed changes.
- Several tokens can be accepted from one expensive target pass.
When it wins
Speculation pays off when the draft model agrees with the target often. Easy, predictable text yields long accepted runs and big speedups. Hard text yields more rejections and smaller gains, but never wrong output.
Key idea
Speculative decoding uses a cheap draft model to propose tokens and an expensive target model to verify many at once. It cuts latency while preserving the target output exactly, with gains that grow when the draft agrees often.