← Lessons

quiz vs the machine

Platinum1800

Machine Learning

The Pipeline Parallelism Deep

Slicing the layer stack across devices and streaming microbatches to fill bubbles.

6 min read · advanced · beat Platinum to climb

Slicing by depth

Pipeline parallelism gives each device a contiguous block of layers, called a stage. A batch flows forward through the stages and gradients flow back, like an assembly line for the network.

The bubble problem

If a whole batch traverses one stage at a time, most devices sit idle waiting. This wasted time is the pipeline bubble.

  • The bubble fraction grows with the number of stages.
  • It shrinks as you feed more microbatches at once.

Scheduling to fill it

  • Splitting a batch into many microbatches keeps several stages busy simultaneously.
  • The one forward one backward schedule interleaves forward and backward passes to cut idle time and memory.
  • Interleaved schedules assign each device several non contiguous layer chunks to shrink the bubble further.

Trade offs

  • Communication is small, only activations at stage boundaries.
  • But peak memory grows with in flight microbatches.
  • Balancing stage compute matters; an uneven split stalls the whole line.

Key idea

Pipeline parallelism splits layers into stages and streams many microbatches with a one forward one backward schedule to shrink the idle bubble, paying only boundary communication.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the pipeline bubble?

2. What shrinks the pipeline bubble?

3. What does pipeline parallelism communicate between stages?