← Lessons

quiz vs the machine

Gold1450

Machine Learning

Speculative Decoding For Latency

Let a small model guess ahead and a big model verify in one pass.

5 min read · core · beat Gold to climb

The latency bottleneck

Large language models generate one token per forward pass, and each pass is slow. Latency is dominated by the number of sequential passes through the big model. Speculative decoding attacks exactly this count.

The draft and verify trick

A small fast draft model guesses several tokens ahead. The large target model then checks all those guesses in a single forward pass. Tokens that match what the target would have chosen are accepted; the first mismatch is corrected and the rest are discarded.

Why it is correct

  • The target verifies every guessed token, so the final output matches what the target alone would produce.
  • Quality is preserved exactly; only speed changes.
  • Several tokens can be accepted from one expensive target pass.

When it wins

Speculation pays off when the draft model agrees with the target often. Easy, predictable text yields long accepted runs and big speedups. Hard text yields more rejections and smaller gains, but never wrong output.

Key idea

Speculative decoding uses a cheap draft model to propose tokens and an expensive target model to verify many at once. It cuts latency while preserving the target output exactly, with gains that grow when the draft agrees often.

Check yourself

Answer to earn rating on the learn ladder.

1. How does speculative decoding preserve output quality?

2. When does speculative decoding give the biggest speedup?