← Lessons

quiz vs the machine

Gold1390

Machine Learning

The Activation Recomputation

Trading extra forward compute to avoid storing activations for the backward pass.

5 min read · core · beat Gold to climb

Why activations dominate memory

Backpropagation needs the forward activations to compute gradients. Storing every layer activation for a deep model on long sequences can use more memory than the weights themselves.

The recomputation trade

Activation recomputation, also called gradient checkpointing, saves only a few activations and recomputes the rest during the backward pass.

  • You keep activations at chosen checkpoints, often one per layer or block.
  • During backward you redo the forward work between checkpoints to regenerate what you dropped.
  • Memory falls sharply at the cost of one extra forward pass.

Choosing what to keep

  • A common scheme checkpoints at block boundaries, cutting memory to roughly the square root of the naive cost.
  • Selective recomputation keeps cheap to store activations and recomputes only the expensive attention pieces.
  • This targets the biggest memory savings for the least extra compute.

When to use it

Recompute when memory, not compute, is the binding constraint, such as long context or very deep models. If compute is the bottleneck, the extra forward pass may not be worth it.

Key idea

Activation recomputation stores only checkpoint activations and recomputes the rest in the backward pass, trading one extra forward pass for a large drop in activation memory.

Check yourself

Answer to earn rating on the learn ladder.

1. What does activation recomputation trade away to save memory?

2. Block boundary checkpointing reduces activation memory to roughly what?

3. When is recomputation most worthwhile?