What it is
A vision transformer, or ViT, applies the transformer architecture to images. Instead of convolutions, it cuts an image into a grid of fixed size patches, flattens each into a vector, and treats the patches as a sequence of tokens.
The pipeline
- Patch embedding: split the image into patches, then linearly project each into a vector.
- Position embeddings: add learned positions, since attention has no built in sense of where a patch sits.
- Transformer encoder: stacked self attention layers let every patch attend to every other patch globally from the first layer.
- Class token or pooling: a special token or a pooled summary feeds the classifier head.
Why it matters
Self attention gives a global receptive field immediately, while a convolution sees only a local window and must stack layers to widen its view.
- ViTs scale very well and tend to beat convolutional networks given enough data.
- With small datasets they lack the built in locality bias of convolutions, so they need strong augmentation, pretraining, or extra regularization.
Key idea
A vision transformer treats image patches as tokens so global self attention replaces convolution, scaling strongly with data but needing more of it.