← Lessons

quiz vs the machine

Platinum1740

Machine Learning

Vision Transformers

Applying the transformer to images by treating patches as tokens.

6 min read · advanced · beat Platinum to climb

What it is

A vision transformer, or ViT, applies the transformer architecture to images. Instead of convolutions, it cuts an image into a grid of fixed size patches, flattens each into a vector, and treats the patches as a sequence of tokens.

The pipeline

  • Patch embedding: split the image into patches, then linearly project each into a vector.
  • Position embeddings: add learned positions, since attention has no built in sense of where a patch sits.
  • Transformer encoder: stacked self attention layers let every patch attend to every other patch globally from the first layer.
  • Class token or pooling: a special token or a pooled summary feeds the classifier head.

Why it matters

Self attention gives a global receptive field immediately, while a convolution sees only a local window and must stack layers to widen its view.

  • ViTs scale very well and tend to beat convolutional networks given enough data.
  • With small datasets they lack the built in locality bias of convolutions, so they need strong augmentation, pretraining, or extra regularization.

Key idea

A vision transformer treats image patches as tokens so global self attention replaces convolution, scaling strongly with data but needing more of it.

Check yourself

Answer to earn rating on the learn ladder.

1. How does a vision transformer turn an image into a sequence?

2. Why do vision transformers often need more data than convolutional networks?