← Lessons

quiz vs the machine

Silver1060

Machine Learning

The GPU Architecture for ML

Why thousands of simple cores make GPUs the workhorse of deep learning.

4 min read · intro · beat Silver to climb

A different kind of processor

A CPU has a handful of powerful cores tuned for low latency on one task at a time. A GPU takes the opposite bet: thousands of simpler cores running the same instruction across many data elements. Deep learning is full of large matrix multiplications, so this massively parallel design fits perfectly.

Streaming multiprocessors

GPU cores are grouped into streaming multiprocessors (SMs). Each SM runs many threads in lockstep groups called warps, typically 32 threads wide.

  • Threads in a warp execute the same instruction on different data.
  • The hardware hides memory latency by swapping in another ready warp.
  • This SIMT model keeps the arithmetic units busy.

The execution flow

Work is launched as a grid of thread blocks, mapped onto SMs by the scheduler.

Why it matters for ML

Training and inference repeat the same operation over huge tensors. The GPU turns that regularity into throughput, often delivering an order of magnitude more useful work per second than a CPU on these workloads.

Key idea

A GPU trades single thread speed for thousands of parallel cores grouped into SMs, which matches the dense parallel math of deep learning.

Check yourself

Answer to earn rating on the learn ladder.

1. What design choice defines a GPU compared to a CPU?

2. What is a warp?