Images as sequences
The vision transformer drops convolution and instead treats an image as a sequence of tokens. It splits the image into fixed size patches, flattens each, and projects it to an embedding, just like word tokens.
Position and class tokens
Transformers have no built in sense of order, so a position embedding is added to each patch so location is known. A special learnable class token is prepended, and its final state summarizes the image for classification.
Self attention over patches
The encoder is a standard transformer. Every patch attends to every other patch, so the model can relate distant regions in a single layer, unlike a convolution whose receptive field grows slowly.
The data appetite
Lacking the built in locality bias of convolutions, vision transformers need lots of data or strong pretraining to match convolutional networks. Given enough scale they meet or exceed them, and hybrid or distilled variants reduce the data requirement.
Why it matters
A single architecture now spans text and images, simplifying multimodal systems and enabling shared pretraining across domains.
Key idea
A vision transformer splits an image into patch tokens with position embeddings and runs a standard self attention encoder, gaining global context at the cost of needing large data or pretraining.