Nucleus Sampling

The decoding choice

When a language model generates text, at each step it has a probability over the whole vocabulary. Always picking the most likely token gives dull, repetitive text, while sampling from the full distribution lets rare nonsense slip in. Nucleus sampling, also called top p, finds a middle path.

How top p works

Instead of a fixed number of candidates, nucleus sampling keeps a dynamic set:

Sort tokens from most to least likely
Add tokens to the nucleus until their probabilities sum past a threshold p, such as zero point nine
Sample only from that nucleus, ignoring the long tail

When the model is confident, the nucleus is tiny, almost greedy. When it is uncertain, the nucleus grows to allow creative variety. This adaptivity is why top p often beats fixed top k sampling.

Tuning

A higher p means more diversity and more risk, a lower p means safer but blander text. It pairs naturally with a temperature that reshapes the distribution before truncation.

Key idea

Nucleus sampling draws from the smallest top set of tokens whose probabilities exceed a threshold, adapting diversity to the model confidence.

The decoding choice

How top p works

Tuning

Key idea

Check yourself