← Lessons

quiz vs the machine

Platinum1800

Machine Learning

The Vision Transformer Deep

Treating image patches as tokens for a pure transformer.

6 min read · advanced · beat Platinum to climb

Images as sequences

The vision transformer drops convolution and instead treats an image as a sequence of tokens. It splits the image into fixed size patches, flattens each, and projects it to an embedding, just like word tokens.

Position and class tokens

Transformers have no built in sense of order, so a position embedding is added to each patch so location is known. A special learnable class token is prepended, and its final state summarizes the image for classification.

Self attention over patches

The encoder is a standard transformer. Every patch attends to every other patch, so the model can relate distant regions in a single layer, unlike a convolution whose receptive field grows slowly.

The data appetite

Lacking the built in locality bias of convolutions, vision transformers need lots of data or strong pretraining to match convolutional networks. Given enough scale they meet or exceed them, and hybrid or distilled variants reduce the data requirement.

Why it matters

A single architecture now spans text and images, simplifying multimodal systems and enabling shared pretraining across domains.

Key idea

A vision transformer splits an image into patch tokens with position embeddings and runs a standard self attention encoder, gaining global context at the cost of needing large data or pretraining.

Check yourself

Answer to earn rating on the learn ladder.

1. What are the tokens in a vision transformer?

2. Why are position embeddings needed?

3. Why do vision transformers often need more data than convolutional networks?