What BERT is
BERT is a transformer encoder stack pretrained to understand text by reading the whole sequence at once in both directions.
- It is bidirectional: every token attends to tokens on both sides.
- This makes it strong for understanding tasks like classification and question answering.
Pretraining
BERT is trained with masked language modeling. Some input tokens are replaced with a mask symbol, and the model predicts the originals from the surrounding context. An auxiliary next sentence prediction task was used originally.
Fine tuning
After pretraining you add a small task head and fine tune on labeled data. A special classification token summarizes the sequence for sentence level tasks.
Because it is encoder only, BERT is not designed to generate free running text.
Key idea
BERT pretrains a bidirectional encoder by predicting masked tokens, producing deep context aware representations that fine tune well for understanding tasks.