Layers of memory
A GPU does not have one kind of memory. It has a hierarchy where each level trades capacity for speed. Knowing this hierarchy is the key to writing fast kernels.
The levels
From fastest and smallest to slowest and largest:
- Registers are private per thread and the fastest storage.
- Shared memory is on chip and shared within a thread block, ideal for cooperative tiling.
- L2 cache is shared across the whole device.
- Global memory is the large off chip DRAM with the highest latency.
Each step down can be many times slower, so a value kept in registers costs far less than one fetched from global memory.
Data flow during a kernel
A well written kernel stages data through the levels.
The optimization rule
The whole game is reuse: load a tile from global memory once into shared memory, then have many threads read it repeatedly. Maximizing reuse at the fast levels is how kernels approach peak performance.
Key idea
GPU memory forms a speed versus size hierarchy, and fast kernels stage data into shared memory and registers to reuse it instead of repeatedly hitting slow global memory.