The decoding choice
When a language model generates text, at each step it has a probability over the whole vocabulary. Always picking the most likely token gives dull, repetitive text, while sampling from the full distribution lets rare nonsense slip in. Nucleus sampling, also called top p, finds a middle path.
How top p works
Instead of a fixed number of candidates, nucleus sampling keeps a dynamic set:
- Sort tokens from most to least likely
- Add tokens to the nucleus until their probabilities sum past a threshold p, such as zero point nine
- Sample only from that nucleus, ignoring the long tail
When the model is confident, the nucleus is tiny, almost greedy. When it is uncertain, the nucleus grows to allow creative variety. This adaptivity is why top p often beats fixed top k sampling.
Tuning
A higher p means more diversity and more risk, a lower p means safer but blander text. It pairs naturally with a temperature that reshapes the distribution before truncation.
Key idea
Nucleus sampling draws from the smallest top set of tokens whose probabilities exceed a threshold, adapting diversity to the model confidence.