The Model Parallelism Deep

When one device is not enough

Data parallelism replicates the whole model on each device. But when the model itself exceeds one device memory, you must split the model across devices. This is model parallelism.

Two main axes to split

Tensor parallelism splits individual layers, sharing each matrix multiply across devices.
Pipeline parallelism splits the stack by layers, with each device owning a contiguous set of layers.

The cost structure

Tensor parallelism needs frequent all reduce communication inside every layer, so it wants fast interconnect.
Pipeline parallelism communicates only at stage boundaries but can leave devices idle in bubbles.
Both reduce per device memory by holding only a shard of the weights.

Combining strategies

Real large runs use 3D parallelism: data parallelism across replicas, tensor parallelism inside a node where links are fast, and pipeline parallelism across nodes. The goal is to keep every device busy while fitting the model.

Key idea

When a model exceeds one device, split it: tensor parallelism shards each layer and pipeline parallelism slices the stack, often combined with data parallelism into a 3D strategy.

The Model Parallelism Deep

When one device is not enough

Two main axes to split

The cost structure

Combining strategies

Key idea

Check yourself