What it is
Model parallelism splits a single model across several devices because the model is too large to fit in one GPU memory. Instead of copying the whole network, each GPU holds a different part of it.
Two common splits
- Tensor parallelism splits an individual layer. A large matrix multiply is divided so each GPU computes part of the output, then results are combined.
- Pipeline parallelism splits by layer groups, called stages. GPU one runs the first stages, passes activations to GPU two, and so on.
The pipeline bubble
Naive pipeline parallelism wastes time. While GPU one works on the first batch, the later GPUs sit idle waiting for activations. This idle time is called the bubble. The fix is to feed many small micro batches so that, once the pipeline fills, every stage stays busy.
Model parallelism adds communication between stages on the critical path of a single forward pass, so it is usually combined with data parallelism rather than used alone.
Key idea
Model parallelism splits one network across devices to fit a huge model, trading extra cross device communication and pipeline bubbles for the ability to train at all.