A grid of threads
A GPU kernel is a function launched across a huge grid of threads, each identified by an index it uses to pick its slice of data. The programming model is single instruction multiple thread: every thread runs the same code on different data.
Warps, blocks, and grids
Threads are organized in a hierarchy:
- A warp is a small group of threads that execute in lockstep on the same instruction.
- A block holds many warps and shares a fast on chip shared memory.
- A grid holds all the blocks of one kernel launch.
Threads in a block can cooperate through shared memory and a barrier that makes them wait for each other; threads in different blocks generally cannot.
Two performance killers
Divergence happens when threads in a warp take different branches; the warp runs both paths with some lanes idle, wasting work. Uncoalesced memory access happens when neighboring threads read scattered addresses instead of contiguous ones, multiplying memory transactions. Fast kernels keep warps converged and accesses contiguous.
Key idea
A GPU kernel runs the same code across a grid of threads grouped into lockstep warps and cooperating blocks, where avoiding branch divergence and uncoalesced memory access is what unlocks the hardware throughput.