← Lessons

quiz vs the machine

Gold1390

Machine Learning

The GPU Memory Hierarchy

Registers, shared memory, and global memory and why the gaps are huge.

5 min read · core · beat Gold to climb

Layers of memory

A GPU does not have one kind of memory. It has a hierarchy where each level trades capacity for speed. Knowing this hierarchy is the key to writing fast kernels.

The levels

From fastest and smallest to slowest and largest:

  • Registers are private per thread and the fastest storage.
  • Shared memory is on chip and shared within a thread block, ideal for cooperative tiling.
  • L2 cache is shared across the whole device.
  • Global memory is the large off chip DRAM with the highest latency.

Each step down can be many times slower, so a value kept in registers costs far less than one fetched from global memory.

Data flow during a kernel

A well written kernel stages data through the levels.

The optimization rule

The whole game is reuse: load a tile from global memory once into shared memory, then have many threads read it repeatedly. Maximizing reuse at the fast levels is how kernels approach peak performance.

Key idea

GPU memory forms a speed versus size hierarchy, and fast kernels stage data into shared memory and registers to reuse it instead of repeatedly hitting slow global memory.

Check yourself

Answer to earn rating on the learn ladder.

1. Which storage is fastest on a GPU?

2. What is the main strategy for fast kernels given the hierarchy?