The GPU Thread Model Basics
A GPU runs the same program across thousands of lightweight threads at once. Unlike a CPU that has a few powerful cores, a GPU has many simple cores designed for massive data parallelism. To organize all those threads, the GPU uses a hierarchy.
Threads are grouped into blocks, and blocks form a grid. Threads within a block can share a fast on chip memory and synchronize with each other, while threads in different blocks generally cannot. The grid lets a single launch cover a huge problem by spreading blocks across the hardware.
- Thread The smallest unit, runs one instance of the program.
- Block A group of threads that share memory and can synchronize.
- Grid All the blocks for one launch, mapped onto the device.
Inside the hardware, threads execute in fixed groups called warps, typically thirty two threads that run in lockstep. All threads in a warp execute the same instruction together. If they take different branches, the warp runs both paths and disables the inactive threads, a slowdown called divergence.
Performance comes from keeping warps busy. Because memory latency is high, the GPU hides it by switching to other ready warps, so launching far more threads than cores is normal and desirable.
Key idea
A GPU organizes thousands of threads into warps, blocks, and a grid, running warps in lockstep and hiding memory latency by switching among many ready warps.