When one device is not enough
Data parallelism replicates the whole model on each device. But when the model itself exceeds one device memory, you must split the model across devices. This is model parallelism.
Two main axes to split
- Tensor parallelism splits individual layers, sharing each matrix multiply across devices.
- Pipeline parallelism splits the stack by layers, with each device owning a contiguous set of layers.
The cost structure
- Tensor parallelism needs frequent all reduce communication inside every layer, so it wants fast interconnect.
- Pipeline parallelism communicates only at stage boundaries but can leave devices idle in bubbles.
- Both reduce per device memory by holding only a shard of the weights.
Combining strategies
Real large runs use 3D parallelism: data parallelism across replicas, tensor parallelism inside a node where links are fast, and pipeline parallelism across nodes. The goal is to keep every device busy while fitting the model.
Key idea
When a model exceeds one device, split it: tensor parallelism shards each layer and pipeline parallelism slices the stack, often combined with data parallelism into a 3D strategy.