Two limits on speed
A GPU kernel is bounded by one of two resources: how fast it can do math, called compute, or how fast it can move data to and from memory, called memory bandwidth. Knowing which one binds a kernel tells you what to optimize.
Arithmetic intensity
The deciding quantity is arithmetic intensity, the number of math operations performed per byte moved from memory.
- Low intensity kernels, like an element wise add, move lots of data per flop and are memory bound.
- High intensity kernels, like a large matrix multiply, reuse data heavily and are compute bound.
The roofline
Plot achievable performance against arithmetic intensity. The result is two limits forming a roof.
- A sloped line set by memory bandwidth bounds low intensity kernels.
- A flat ceiling set by peak compute bounds high intensity kernels.
- The ridge point where they meet marks the intensity needed to saturate compute.
A kernel sitting under the sloped part is memory bound, so better data reuse or fusing operations helps more than faster math units. This is why techniques that cut memory traffic, such as kernel fusion and flash attention, give large speedups on memory bound work.
Key idea
The roofline model uses arithmetic intensity to classify a kernel as memory bound or compute bound; memory bound kernels improve most from reducing data movement, not from faster arithmetic.