The Autoregressive Generation

Generate data one element at a time by predicting each piece from those before it.

The Autoregressive Generation

Autoregressive models generate data one element at a time. Each new element is predicted from all the elements produced so far, factoring the joint distribution into a chain of conditionals.

The chain rule of probability

The probability of a sequence is the product of the probability of each element given the previous ones.
The model learns these conditional distributions and applies them in order.
This is how language models predict the next token, and how pixel models predict the next pixel.

Generating step by step

Sample or pick the first element.
Feed it back in and predict the second.
Continue until the sequence is complete. This feeding back is called sampling autoregressively.

Strengths and costs

Training is stable and uses an exact likelihood objective, the same next element prediction at every position.
Generation is sequential, so producing long outputs is slow because steps cannot be parallelized.
Quality is high, which is why this family underpins modern large language models.

Key idea

Autoregressive models factor a sequence into a product of conditionals and generate one element at a time, giving exact likelihoods and high quality at the cost of slow sequential sampling.

The Autoregressive Generation

The Autoregressive Generation

The chain rule of probability

Generating step by step

Strengths and costs

Key idea

Check yourself