← Lessons

quiz vs the machine

Gold1460

Machine Learning

The Kernel Fusion

Merging operations into one kernel to avoid round trips to memory.

5 min read · core · beat Gold to climb

The cost of separate kernels

Each GPU kernel launch reads inputs from global memory and writes outputs back. When a model runs a chain of small operations such as a matmul, a bias add, and an activation, the intermediate results bounce to global memory and back, wasting bandwidth and launch overhead.

What fusion does

Kernel fusion merges several operations into a single kernel. Intermediate values stay in fast registers or shared memory instead of being written out.

  • Fewer memory round trips for bandwidth bound chains.
  • Fewer kernel launches and less overhead.
  • Better cache and register reuse.

Before and after

In the fused version those three stages collapse into one pass that keeps the intermediate in registers, eliminating the writes and reads between them.

Where it shines

Fusion is most valuable for sequences of cheap elementwise operations that are memory bound, and for attention where fusing the softmax and matmuls avoids huge intermediate matrices. Compilers and libraries apply many fusions automatically.

Key idea

Kernel fusion combines several operations into one kernel so intermediates stay in fast memory, cutting bandwidth use and launch overhead for memory bound chains.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the main benefit of kernel fusion?

2. Which workloads benefit most from fusion?