The GPT Architecture

What GPT is

GPT is a decoder only transformer trained to predict the next token in a sequence, making it a natural text generator.

It uses causal or masked self attention, so each position attends only to earlier positions.
This left to right constraint enforces a valid autoregressive generation order.

Pretraining

The objective is simple next token prediction: maximize the probability of the true next token given everything before it. Trained at scale on huge corpora, this yields broad language ability.

Generation

At inference the model samples one token, appends it, and repeats. Sampling controls such as temperature trade off diversity against coherence.

Unlike BERT, GPT can both understand and generate, which underpins modern chat assistants after further instruction tuning.

Key idea

GPT is a causal decoder trained on next token prediction, so it generates text autoregressively by feeding each predicted token back as new context.

The GPT Architecture

What GPT is

Pretraining

Generation

Key idea

Check yourself