The Activation Recomputation

Trading extra forward compute to avoid storing activations for the backward pass.

Why activations dominate memory

Backpropagation needs the forward activations to compute gradients. Storing every layer activation for a deep model on long sequences can use more memory than the weights themselves.

The recomputation trade

Activation recomputation, also called gradient checkpointing, saves only a few activations and recomputes the rest during the backward pass.

You keep activations at chosen checkpoints, often one per layer or block.
During backward you redo the forward work between checkpoints to regenerate what you dropped.
Memory falls sharply at the cost of one extra forward pass.

Choosing what to keep

A common scheme checkpoints at block boundaries, cutting memory to roughly the square root of the naive cost.
Selective recomputation keeps cheap to store activations and recomputes only the expensive attention pieces.
This targets the biggest memory savings for the least extra compute.

When to use it

Recompute when memory, not compute, is the binding constraint, such as long context or very deep models. If compute is the bottleneck, the extra forward pass may not be worth it.

Key idea