The Compute Bound Kernels

The other side of the roofline

A compute bound kernel is limited by the rate at which arithmetic units can do floating point work, not by memory. The data it needs already fits in fast on chip storage and is reused many times, so the multiply add pipelines stay full.

Where this happens

Large dense matrix multiplications are the classic example. A tile loaded into shared memory is multiplied against many other tiles before being evicted, giving high arithmetic intensity.

Big batched matmuls and convolutions tend to be compute bound.
Tensor cores push the compute ceiling very high.
Reaching that ceiling requires keeping units busy with no stalls.

Reaching peak

Performance climbs with intensity until it hits the compute roof.

Tuning compute bound work

Since you cannot exceed peak FLOPs, you instead chase efficiency:

Pick tile sizes that map cleanly onto tensor cores.
Ensure enough occupancy to hide pipeline latency.
Use lower precision to raise the peak rate itself.

Key idea