Beam Search

The decoding problem

When a model generates text, greedily picking the single best token at each step can miss a better overall sequence. Beam search keeps several candidate sequences alive at once to search more thoroughly.

How it works

The beam width sets how many partial sequences to keep
At each step every candidate is extended by all possible next tokens
The combined scores are ranked and only the top width sequences survive
This repeats until sequences end

Because it tracks multiple paths, beam search often finds a sequence with higher total probability than greedy decoding.

Trade offs

Beam search is strong for tasks with one clearly correct answer, like translation. For open ended generation it tends to be bland and repetitive, since high probability text is often generic. It can also favor short outputs unless a length penalty is applied. Modern chat models often prefer sampling for variety, reserving beam search for precision tasks.

Key idea

Beam search keeps the top few candidate sequences at each step, trading more computation for higher probability outputs than greedy decoding.

The decoding problem

How it works

Trade offs

Key idea

Check yourself