The Kernel Fusion

The cost of separate kernels

Each GPU kernel launch reads inputs from global memory and writes outputs back. When a model runs a chain of small operations such as a matmul, a bias add, and an activation, the intermediate results bounce to global memory and back, wasting bandwidth and launch overhead.

What fusion does

Kernel fusion merges several operations into a single kernel. Intermediate values stay in fast registers or shared memory instead of being written out.

Fewer memory round trips for bandwidth bound chains.
Fewer kernel launches and less overhead.
Better cache and register reuse.

Before and after

In the fused version those three stages collapse into one pass that keeps the intermediate in registers, eliminating the writes and reads between them.

Where it shines

Fusion is most valuable for sequences of cheap elementwise operations that are memory bound, and for attention where fusing the softmax and matmuls avoids huge intermediate matrices. Compilers and libraries apply many fusions automatically.

Key idea

Kernel fusion combines several operations into one kernel so intermediates stay in fast memory, cutting bandwidth use and launch overhead for memory bound chains.

The cost of separate kernels

What fusion does

Before and after

Where it shines

Key idea

Check yourself