← Lessons

quiz vs the machine

Silver1110

Machine Learning

Model Parallel Training

Split one model across devices when it is too big to fit on a single GPU.

5 min read · intro · beat Silver to climb

What it is

Model parallelism splits a single model across several devices because the model is too large to fit in one GPU memory. Instead of copying the whole network, each GPU holds a different part of it.

Two common splits

  • Tensor parallelism splits an individual layer. A large matrix multiply is divided so each GPU computes part of the output, then results are combined.
  • Pipeline parallelism splits by layer groups, called stages. GPU one runs the first stages, passes activations to GPU two, and so on.

The pipeline bubble

Naive pipeline parallelism wastes time. While GPU one works on the first batch, the later GPUs sit idle waiting for activations. This idle time is called the bubble. The fix is to feed many small micro batches so that, once the pipeline fills, every stage stays busy.

Model parallelism adds communication between stages on the critical path of a single forward pass, so it is usually combined with data parallelism rather than used alone.

Key idea

Model parallelism splits one network across devices to fit a huge model, trading extra cross device communication and pipeline bubbles for the ability to train at all.

Check yourself

Answer to earn rating on the learn ladder.

1. Why would you use model parallelism instead of data parallelism?

2. What is the pipeline bubble?