Two ways to be slow
A GPU operation can be limited by how fast it computes or by how fast it moves data. A memory bandwidth bound kernel spends most of its time reading and writing memory, leaving the math units idle.
Arithmetic intensity
The key metric is arithmetic intensity: the number of floating point operations performed per byte moved from memory.
- Low intensity operations such as adding two large vectors do little math per byte and are bandwidth bound.
- High intensity operations such as large matrix multiplies reuse data many times and are compute bound.
The roofline view
Plotting performance against intensity gives a roofline: bandwidth limits the left side, peak compute caps the right.
What to do about it
To speed up a bandwidth bound kernel you reduce data movement rather than add math:
- Fuse operations so intermediate results stay in fast registers.
- Use lower precision to move fewer bytes.
- Improve locality so reused data stays in cache.
Key idea
A memory bandwidth bound kernel is limited by data movement, so low arithmetic intensity operations speed up by moving fewer bytes, not by adding compute.