Model Parallel Training

What it is

Model parallelism splits a single model across several devices because the model is too large to fit in one GPU memory. Instead of copying the whole network, each GPU holds a different part of it.

Two common splits

Tensor parallelism splits an individual layer. A large matrix multiply is divided so each GPU computes part of the output, then results are combined.
Pipeline parallelism splits by layer groups, called stages. GPU one runs the first stages, passes activations to GPU two, and so on.

The pipeline bubble

Naive pipeline parallelism wastes time. While GPU one works on the first batch, the later GPUs sit idle waiting for activations. This idle time is called the bubble. The fix is to feed many small micro batches so that, once the pipeline fills, every stage stays busy.

Model parallelism adds communication between stages on the critical path of a single forward pass, so it is usually combined with data parallelism rather than used alone.

Key idea

Model parallelism splits one network across devices to fit a huge model, trading extra cross device communication and pipeline bubbles for the ability to train at all.

Model Parallel Training

What it is

Two common splits

The pipeline bubble

Key idea

Check yourself