Speculative Decoding

The bottleneck

Large language models generate one token per forward pass, and each pass over a big model is slow. Most of the time the next token is easy and even a small model would have guessed it. Speculative decoding exploits this.

How it works

A small fast draft model proposes several tokens ahead. The large target model then checks all of those proposed tokens in a single forward pass.

The draft model writes a short guess of several tokens
The target model scores that whole guess in one pass
Tokens that match what the target would have chosen are accepted
At the first mismatch, the target supplies the correct token and drafting restarts

Why it is exact

The acceptance test is designed so the final output has the same distribution as decoding from the target model alone. Speculative decoding is a speed trick, not an approximation. The quality is identical to the big model running by itself.

The payoff

When the draft model is accurate, many tokens are accepted per target pass, giving a large speedup. When it guesses poorly, fewer are accepted, but correctness is never compromised because the target always has the final say.

Key idea

Speculative decoding lets a small model guess ahead while the big model verifies in one pass, speeding generation with identical output quality.

Speculative Decoding

The bottleneck

How it works

Why it is exact

The payoff

Key idea

Check yourself