The Sequence Parallelism

Splitting the sequence dimension to shave activation memory in long context training.

A memory problem tensor parallelism misses

Tensor parallelism shards the matrix multiplies, but the layer norm and dropout regions still replicate full activations across devices. With long sequences those activations dominate memory.

The idea

Sequence parallelism splits the activations along the sequence length dimension in exactly those replicated regions.

Each device holds activations for only its slice of tokens there.
It pairs with tensor parallelism, switching between the two layouts inside each block.
The switch uses cheap all gather and reduce scatter collectives instead of full all reduce.

Why it pays off

Activation memory in the norm and dropout regions drops by the parallel degree.
It enables longer context or larger batches without extra devices.
Total communication volume stays similar to plain tensor parallelism.

Relation to context parallelism

For very long contexts, context parallelism extends the idea to the attention computation itself, sharding keys and values across the sequence so no single device holds the full attention matrix.

Key idea