The GPU Memory Hierarchy

Layers of memory

A GPU does not have one kind of memory. It has a hierarchy where each level trades capacity for speed. Knowing this hierarchy is the key to writing fast kernels.

The levels

From fastest and smallest to slowest and largest:

Registers are private per thread and the fastest storage.
Shared memory is on chip and shared within a thread block, ideal for cooperative tiling.
L2 cache is shared across the whole device.
Global memory is the large off chip DRAM with the highest latency.

Each step down can be many times slower, so a value kept in registers costs far less than one fetched from global memory.

Data flow during a kernel

A well written kernel stages data through the levels.

The optimization rule

The whole game is reuse: load a tile from global memory once into shared memory, then have many threads read it repeatedly. Maximizing reuse at the fast levels is how kernels approach peak performance.

Key idea