The BERT Architecture

What BERT is

BERT is a transformer encoder stack pretrained to understand text by reading the whole sequence at once in both directions.

It is bidirectional: every token attends to tokens on both sides.
This makes it strong for understanding tasks like classification and question answering.

Pretraining

BERT is trained with masked language modeling. Some input tokens are replaced with a mask symbol, and the model predicts the originals from the surrounding context. An auxiliary next sentence prediction task was used originally.

Fine tuning

After pretraining you add a small task head and fine tune on labeled data. A special classification token summarizes the sequence for sentence level tasks.

Because it is encoder only, BERT is not designed to generate free running text.

Key idea

BERT pretrains a bidirectional encoder by predicting masked tokens, producing deep context aware representations that fine tune well for understanding tasks.

The BERT Architecture

What BERT is

Pretraining

Fine tuning

Key idea

Check yourself