When the model will not fit
Some networks are too large to fit on a single accelerator. Model parallelism splits the model itself across devices, so each device stores and computes only a part of the parameters.
- Device boundaries cut through the model, not the data.
- Activations must travel between devices during the forward pass.
- Gradients flow back across the same boundaries in the backward pass.
The trade off
Model parallelism unlocks training of huge networks, but it introduces a serial dependency. A later partition cannot start until the earlier one passes its activations forward, so devices can sit idle waiting on each other.
- It solves a memory problem, not always a speed problem.
- Naive splits leave devices underused.
- Pipeline and tensor variants exist to reduce that idle time.
A two way split
Each part lives on its own device, and the activations crossing the boundary are the cost you pay.
Key idea
Model parallelism partitions one model across devices to fit large networks, trading extra activation communication and possible idle time for memory headroom.