← Lessons

quiz vs the machine

Gold1360

Machine Learning

The Compute Bound Kernels

When the math units are saturated and bandwidth has room to spare.

4 min read · core · beat Gold to climb

The other side of the roofline

A compute bound kernel is limited by the rate at which arithmetic units can do floating point work, not by memory. The data it needs already fits in fast on chip storage and is reused many times, so the multiply add pipelines stay full.

Where this happens

Large dense matrix multiplications are the classic example. A tile loaded into shared memory is multiplied against many other tiles before being evicted, giving high arithmetic intensity.

  • Big batched matmuls and convolutions tend to be compute bound.
  • Tensor cores push the compute ceiling very high.
  • Reaching that ceiling requires keeping units busy with no stalls.

Reaching peak

Performance climbs with intensity until it hits the compute roof.

Tuning compute bound work

Since you cannot exceed peak FLOPs, you instead chase efficiency:

  • Pick tile sizes that map cleanly onto tensor cores.
  • Ensure enough occupancy to hide pipeline latency.
  • Use lower precision to raise the peak rate itself.

Key idea

A compute bound kernel saturates the arithmetic units, so gains come from raising peak throughput with precision and tiling rather than from cutting memory traffic.

Check yourself

Answer to earn rating on the learn ladder.

1. What limits a compute bound kernel?

2. Which workload is most likely compute bound?