The decoding problem
When a model generates text, greedily picking the single best token at each step can miss a better overall sequence. Beam search keeps several candidate sequences alive at once to search more thoroughly.
How it works
- The beam width sets how many partial sequences to keep
- At each step every candidate is extended by all possible next tokens
- The combined scores are ranked and only the top width sequences survive
- This repeats until sequences end
Because it tracks multiple paths, beam search often finds a sequence with higher total probability than greedy decoding.
Trade offs
Beam search is strong for tasks with one clearly correct answer, like translation. For open ended generation it tends to be bland and repetitive, since high probability text is often generic. It can also favor short outputs unless a length penalty is applied. Modern chat models often prefer sampling for variety, reserving beam search for precision tasks.
Key idea
Beam search keeps the top few candidate sequences at each step, trading more computation for higher probability outputs than greedy decoding.