The Tensor Parallelism Deep

Splitting inside a layer

Tensor parallelism shards the weight matrices of a single layer across devices, so one large matrix multiply becomes several partial multiplies that combine. Each device holds only a column or row slice of the weights.

The two cut directions

A column split of the first matrix lets each device compute part of the hidden activations independently.
A row split of the second matrix then needs an all reduce to sum partial outputs.
Pairing a column split with a following row split limits communication to one all reduce per block.

Attention and MLP

In an MLP, the two linear layers use the column then row pattern.
In attention, heads are split across devices, since heads are already independent.
A final all reduce gathers the combined result for the residual stream.

Why it is tricky

It demands very fast interconnect because communication sits on the critical path of every layer.
It scales well only up to the devices sharing high bandwidth links, often within one node.
Beyond that, pipeline or data parallelism takes over.

Key idea

Tensor parallelism shards each layer matrix, using a column then row split so attention heads and MLPs need just one all reduce per block, but only across fast interconnect.

The Tensor Parallelism Deep

Splitting inside a layer

The two cut directions

Attention and MLP

Why it is tricky

Key idea

Check yourself