The GPU Architecture for ML

A different kind of processor

A CPU has a handful of powerful cores tuned for low latency on one task at a time. A GPU takes the opposite bet: thousands of simpler cores running the same instruction across many data elements. Deep learning is full of large matrix multiplications, so this massively parallel design fits perfectly.

Streaming multiprocessors

GPU cores are grouped into streaming multiprocessors (SMs). Each SM runs many threads in lockstep groups called warps, typically 32 threads wide.

Threads in a warp execute the same instruction on different data.
The hardware hides memory latency by swapping in another ready warp.
This SIMT model keeps the arithmetic units busy.

The execution flow

Work is launched as a grid of thread blocks, mapped onto SMs by the scheduler.

Why it matters for ML

Training and inference repeat the same operation over huge tensors. The GPU turns that regularity into throughput, often delivering an order of magnitude more useful work per second than a CPU on these workloads.

Key idea

A GPU trades single thread speed for thousands of parallel cores grouped into SMs, which matches the dense parallel math of deep learning.

The GPU Architecture for ML

A different kind of processor

Streaming multiprocessors

The execution flow

Why it matters for ML

Key idea

Check yourself