The Model Parallelism

When the model will not fit

Some networks are too large to fit on a single accelerator. Model parallelism splits the model itself across devices, so each device stores and computes only a part of the parameters.

Device boundaries cut through the model, not the data.
Activations must travel between devices during the forward pass.
Gradients flow back across the same boundaries in the backward pass.

The trade off

Model parallelism unlocks training of huge networks, but it introduces a serial dependency. A later partition cannot start until the earlier one passes its activations forward, so devices can sit idle waiting on each other.

It solves a memory problem, not always a speed problem.
Naive splits leave devices underused.
Pipeline and tensor variants exist to reduce that idle time.

A two way split

Each part lives on its own device, and the activations crossing the boundary are the cost you pay.

Key idea

Model parallelism partitions one model across devices to fit large networks, trading extra activation communication and possible idle time for memory headroom.

The Model Parallelism

When the model will not fit

The trade off

A two way split

Key idea

Check yourself