The Vision Transformer Patches
The vision transformer, or ViT, applies the transformer architecture to images by turning a picture into a sequence of patches. It shows that attention, not convolution, can drive strong vision models.
From image to tokens
A ViT first cuts the image into a grid of fixed size patches, often sixteen by sixteen pixels.
- Each patch is flattened into a vector.
- A linear layer projects it into an embedding, producing one token per patch.
- A position embedding is added so the model knows where each patch sat.
The image is now a sequence of tokens, just like words in a sentence.
Attention over patches
A standard transformer encoder then processes the tokens. Self attention lets every patch look at every other patch directly, so the model captures global relationships from the very first layer, unlike a convolution with a small receptive field.
A special classification token is often added, and its final embedding feeds a head that predicts the image class.
Trade offs
ViTs lack the built in locality and translation bias of CNNs, so they typically need large datasets or strong augmentation to train well. Given enough data they match or beat convolutional networks, which is why patch based attention now sits at the core of many vision systems.
Key idea
A vision transformer splits an image into patch tokens with position embeddings and uses self attention for global reasoning, needing large data to shine.