← Lessons

quiz vs the machine

Silver1080

Machine Learning

The Model Parallelism

Split a model too big for one device across several devices.

4 min read · intro · beat Silver to climb

When the model will not fit

Some networks are too large to fit on a single accelerator. Model parallelism splits the model itself across devices, so each device stores and computes only a part of the parameters.

  • Device boundaries cut through the model, not the data.
  • Activations must travel between devices during the forward pass.
  • Gradients flow back across the same boundaries in the backward pass.

The trade off

Model parallelism unlocks training of huge networks, but it introduces a serial dependency. A later partition cannot start until the earlier one passes its activations forward, so devices can sit idle waiting on each other.

  • It solves a memory problem, not always a speed problem.
  • Naive splits leave devices underused.
  • Pipeline and tensor variants exist to reduce that idle time.

A two way split

Each part lives on its own device, and the activations crossing the boundary are the cost you pay.

Key idea

Model parallelism partitions one model across devices to fit large networks, trading extra activation communication and possible idle time for memory headroom.

Check yourself

Answer to earn rating on the learn ladder.

1. What problem does model parallelism primarily solve?

2. What is the main downside of naive model parallelism?