Splitting inside a layer
Tensor parallelism shards the weight matrices of a single layer across devices, so one large matrix multiply becomes several partial multiplies that combine. Each device holds only a column or row slice of the weights.
The two cut directions
- A column split of the first matrix lets each device compute part of the hidden activations independently.
- A row split of the second matrix then needs an all reduce to sum partial outputs.
- Pairing a column split with a following row split limits communication to one all reduce per block.
Attention and MLP
- In an MLP, the two linear layers use the column then row pattern.
- In attention, heads are split across devices, since heads are already independent.
- A final all reduce gathers the combined result for the residual stream.
Why it is tricky
- It demands very fast interconnect because communication sits on the critical path of every layer.
- It scales well only up to the devices sharing high bandwidth links, often within one node.
- Beyond that, pipeline or data parallelism takes over.
Key idea
Tensor parallelism shards each layer matrix, using a column then row split so attention heads and MLPs need just one all reduce per block, but only across fast interconnect.