Vision Transformers

What it is

A vision transformer, or ViT, applies the transformer architecture to images. Instead of convolutions, it cuts an image into a grid of fixed size patches, flattens each into a vector, and treats the patches as a sequence of tokens.

The pipeline

Patch embedding: split the image into patches, then linearly project each into a vector.
Position embeddings: add learned positions, since attention has no built in sense of where a patch sits.
Transformer encoder: stacked self attention layers let every patch attend to every other patch globally from the first layer.
Class token or pooling: a special token or a pooled summary feeds the classifier head.

Why it matters

Self attention gives a global receptive field immediately, while a convolution sees only a local window and must stack layers to widen its view.

ViTs scale very well and tend to beat convolutional networks given enough data.
With small datasets they lack the built in locality bias of convolutions, so they need strong augmentation, pretraining, or extra regularization.

Key idea

A vision transformer treats image patches as tokens so global self attention replaces convolution, scaling strongly with data but needing more of it.

Vision Transformers

What it is

The pipeline

Why it matters

Key idea

Check yourself