Independent output cells
In a matrix product, every output cell is a dot product of one row and one column. Crucially, no output cell depends on another, so all of them can be computed independently in parallel. The challenge is not correctness but memory traffic.
Blocking for locality
A naive triple loop streams whole rows and columns repeatedly, thrashing the cache. Blocking, or tiling, fixes this:
- Partition the matrices into small square tiles that fit in fast cache.
- Compute the product tile by tile, reusing each loaded tile many times.
- Accumulate partial tile products into the output tile.
Each tile multiply is independent work that a separate core or thread can own, so blocking improves both locality and parallelism at once.
Scaling beyond one machine
For very large matrices, distributed algorithms like Cannon assign tiles to a grid of processors and shift them in lockstep so each processor only ever holds the tiles it needs. The pattern keeps communication bounded while every processor stays busy.
Key idea
Matrix multiply parallelizes naturally because output cells are independent, and tiling the work into cache sized blocks turns it into many reusable, locality friendly chunks of independent compute.