← Lessons

quiz vs the machine

Platinum1780

Machine Learning

The Tensor Parallelism Deep

Sharding the matrices inside a layer so one big multiply spans many devices.

6 min read · advanced · beat Platinum to climb

Splitting inside a layer

Tensor parallelism shards the weight matrices of a single layer across devices, so one large matrix multiply becomes several partial multiplies that combine. Each device holds only a column or row slice of the weights.

The two cut directions

  • A column split of the first matrix lets each device compute part of the hidden activations independently.
  • A row split of the second matrix then needs an all reduce to sum partial outputs.
  • Pairing a column split with a following row split limits communication to one all reduce per block.

Attention and MLP

  • In an MLP, the two linear layers use the column then row pattern.
  • In attention, heads are split across devices, since heads are already independent.
  • A final all reduce gathers the combined result for the residual stream.

Why it is tricky

  • It demands very fast interconnect because communication sits on the critical path of every layer.
  • It scales well only up to the devices sharing high bandwidth links, often within one node.
  • Beyond that, pipeline or data parallelism takes over.

Key idea

Tensor parallelism shards each layer matrix, using a column then row split so attention heads and MLPs need just one all reduce per block, but only across fast interconnect.

Check yourself

Answer to earn rating on the learn ladder.

1. Why pair a column split with a following row split in tensor parallelism?

2. How is multi head attention split in tensor parallelism?

3. What limits how far tensor parallelism scales?