The cost of separate kernels
Each GPU kernel launch reads inputs from global memory and writes outputs back. When a model runs a chain of small operations such as a matmul, a bias add, and an activation, the intermediate results bounce to global memory and back, wasting bandwidth and launch overhead.
What fusion does
Kernel fusion merges several operations into a single kernel. Intermediate values stay in fast registers or shared memory instead of being written out.
- Fewer memory round trips for bandwidth bound chains.
- Fewer kernel launches and less overhead.
- Better cache and register reuse.
Before and after
In the fused version those three stages collapse into one pass that keeps the intermediate in registers, eliminating the writes and reads between them.
Where it shines
Fusion is most valuable for sequences of cheap elementwise operations that are memory bound, and for attention where fusing the softmax and matmuls avoids huge intermediate matrices. Compilers and libraries apply many fusions automatically.
Key idea
Kernel fusion combines several operations into one kernel so intermediates stay in fast memory, cutting bandwidth use and launch overhead for memory bound chains.