A different kind of processor
A CPU has a handful of powerful cores tuned for low latency on one task at a time. A GPU takes the opposite bet: thousands of simpler cores running the same instruction across many data elements. Deep learning is full of large matrix multiplications, so this massively parallel design fits perfectly.
Streaming multiprocessors
GPU cores are grouped into streaming multiprocessors (SMs). Each SM runs many threads in lockstep groups called warps, typically 32 threads wide.
- Threads in a warp execute the same instruction on different data.
- The hardware hides memory latency by swapping in another ready warp.
- This SIMT model keeps the arithmetic units busy.
The execution flow
Work is launched as a grid of thread blocks, mapped onto SMs by the scheduler.
Why it matters for ML
Training and inference repeat the same operation over huge tensors. The GPU turns that regularity into throughput, often delivering an order of magnitude more useful work per second than a CPU on these workloads.
Key idea
A GPU trades single thread speed for thousands of parallel cores grouped into SMs, which matches the dense parallel math of deep learning.