The Autoregressive Generation
Autoregressive models generate data one element at a time. Each new element is predicted from all the elements produced so far, factoring the joint distribution into a chain of conditionals.
The chain rule of probability
- The probability of a sequence is the product of the probability of each element given the previous ones.
- The model learns these conditional distributions and applies them in order.
- This is how language models predict the next token, and how pixel models predict the next pixel.
Generating step by step
- Sample or pick the first element.
- Feed it back in and predict the second.
- Continue until the sequence is complete. This feeding back is called sampling autoregressively.
Strengths and costs
- Training is stable and uses an exact likelihood objective, the same next element prediction at every position.
- Generation is sequential, so producing long outputs is slow because steps cannot be parallelized.
- Quality is high, which is why this family underpins modern large language models.
Key idea
Autoregressive models factor a sequence into a product of conditionals and generate one element at a time, giving exact likelihoods and high quality at the cost of slow sequential sampling.