← Lessons

quiz vs the machine

Platinum1730

Machine Learning

The Gradient Accumulation Practical

Simulating large batches on small memory by summing gradients over steps.

4 min read · advanced · beat Platinum to climb

The memory wall

Large batches stabilize training but may not fit in GPU memory. Gradient accumulation lets you simulate a big batch by running several small forward and backward passes and summing their gradients before a single optimizer step.

How it works

Instead of updating after every mini batch, you accumulate gradients across several mini batches, then apply one update. If you accumulate over four steps of size eight, the effective batch is thirty two.

The loop

Getting it right

  • Either scale the loss by one over the accumulation steps or average the gradients, so the magnitude matches a true large batch.
  • Only call the optimizer step and zero the gradients once per accumulation cycle.
  • Batch norm statistics still come from the micro batch, not the full effective batch, which is a subtle caveat.

Practical notes

  • It trades memory for time since you run more passes per update.
  • Pair it with a learning rate tuned for the larger effective batch.

Key idea

Gradient accumulation sums gradients over several micro batches to mimic a large batch on limited memory. Scale the loss correctly and step the optimizer only once per cycle.

Check yourself

Answer to earn rating on the learn ladder.

1. What does gradient accumulation simulate?

2. What must you be careful about when accumulating gradients?