← Lessons

quiz vs the machine

Gold1410

Machine Learning

The Autoregressive Generation

Generate data one element at a time by predicting each piece from those before it.

5 min read · core · beat Gold to climb

The Autoregressive Generation

Autoregressive models generate data one element at a time. Each new element is predicted from all the elements produced so far, factoring the joint distribution into a chain of conditionals.

The chain rule of probability

  • The probability of a sequence is the product of the probability of each element given the previous ones.
  • The model learns these conditional distributions and applies them in order.
  • This is how language models predict the next token, and how pixel models predict the next pixel.

Generating step by step

  • Sample or pick the first element.
  • Feed it back in and predict the second.
  • Continue until the sequence is complete. This feeding back is called sampling autoregressively.

Strengths and costs

  • Training is stable and uses an exact likelihood objective, the same next element prediction at every position.
  • Generation is sequential, so producing long outputs is slow because steps cannot be parallelized.
  • Quality is high, which is why this family underpins modern large language models.

Key idea

Autoregressive models factor a sequence into a product of conditionals and generate one element at a time, giving exact likelihoods and high quality at the cost of slow sequential sampling.

Check yourself

Answer to earn rating on the learn ladder.

1. How does an autoregressive model generate data?

2. What is the main cost of autoregressive generation?

3. What rule factors the joint distribution into conditionals?