← Lessons

quiz vs the machine

Gold1450

Machine Learning

The BERT Architecture

A bidirectional encoder pretrained by masked language modeling.

5 min read · core · beat Gold to climb

What BERT is

BERT is a transformer encoder stack pretrained to understand text by reading the whole sequence at once in both directions.

  • It is bidirectional: every token attends to tokens on both sides.
  • This makes it strong for understanding tasks like classification and question answering.

Pretraining

BERT is trained with masked language modeling. Some input tokens are replaced with a mask symbol, and the model predicts the originals from the surrounding context. An auxiliary next sentence prediction task was used originally.

Fine tuning

After pretraining you add a small task head and fine tune on labeled data. A special classification token summarizes the sequence for sentence level tasks.

Because it is encoder only, BERT is not designed to generate free running text.

Key idea

BERT pretrains a bidirectional encoder by predicting masked tokens, producing deep context aware representations that fine tune well for understanding tasks.

Check yourself

Answer to earn rating on the learn ladder.

1. What pretraining objective does BERT primarily use?

2. Why is BERT well suited to understanding tasks?