← Lessons

quiz vs the machine

Gold1470

Machine Learning

Nucleus Sampling

Sampling from the smallest set of likely tokens that covers most probability.

4 min read · core · beat Gold to climb

The decoding choice

When a language model generates text, at each step it has a probability over the whole vocabulary. Always picking the most likely token gives dull, repetitive text, while sampling from the full distribution lets rare nonsense slip in. Nucleus sampling, also called top p, finds a middle path.

How top p works

Instead of a fixed number of candidates, nucleus sampling keeps a dynamic set:

  • Sort tokens from most to least likely
  • Add tokens to the nucleus until their probabilities sum past a threshold p, such as zero point nine
  • Sample only from that nucleus, ignoring the long tail

When the model is confident, the nucleus is tiny, almost greedy. When it is uncertain, the nucleus grows to allow creative variety. This adaptivity is why top p often beats fixed top k sampling.

Tuning

A higher p means more diversity and more risk, a lower p means safer but blander text. It pairs naturally with a temperature that reshapes the distribution before truncation.

Key idea

Nucleus sampling draws from the smallest top set of tokens whose probabilities exceed a threshold, adapting diversity to the model confidence.

Check yourself

Answer to earn rating on the learn ladder.

1. How does nucleus sampling choose its candidate set?

2. When the model is very confident, the nucleus is?