← Lessons

quiz vs the machine

Gold1440

Machine Learning

The Sequence Parallelism

Splitting the sequence dimension to shave activation memory in long context training.

5 min read · core · beat Gold to climb

A memory problem tensor parallelism misses

Tensor parallelism shards the matrix multiplies, but the layer norm and dropout regions still replicate full activations across devices. With long sequences those activations dominate memory.

The idea

Sequence parallelism splits the activations along the sequence length dimension in exactly those replicated regions.

  • Each device holds activations for only its slice of tokens there.
  • It pairs with tensor parallelism, switching between the two layouts inside each block.
  • The switch uses cheap all gather and reduce scatter collectives instead of full all reduce.

Why it pays off

  • Activation memory in the norm and dropout regions drops by the parallel degree.
  • It enables longer context or larger batches without extra devices.
  • Total communication volume stays similar to plain tensor parallelism.

Relation to context parallelism

For very long contexts, context parallelism extends the idea to the attention computation itself, sharding keys and values across the sequence so no single device holds the full attention matrix.

Key idea

Sequence parallelism shards activations along sequence length in the layer norm and dropout regions, cutting activation memory and pairing with tensor parallelism via all gather and reduce scatter.

Check yourself

Answer to earn rating on the learn ladder.

1. Which activations does sequence parallelism target?

2. What collectives connect the tensor and sequence parallel layouts?