The Gradient Accumulation Practical

The memory wall

Large batches stabilize training but may not fit in GPU memory. Gradient accumulation lets you simulate a big batch by running several small forward and backward passes and summing their gradients before a single optimizer step.

How it works

Instead of updating after every mini batch, you accumulate gradients across several mini batches, then apply one update. If you accumulate over four steps of size eight, the effective batch is thirty two.

The loop

Getting it right

Either scale the loss by one over the accumulation steps or average the gradients, so the magnitude matches a true large batch.
Only call the optimizer step and zero the gradients once per accumulation cycle.
Batch norm statistics still come from the micro batch, not the full effective batch, which is a subtle caveat.

Practical notes

It trades memory for time since you run more passes per update.
Pair it with a learning rate tuned for the larger effective batch.

Key idea

Gradient accumulation sums gradients over several micro batches to mimic a large batch on limited memory. Scale the loss correctly and step the optimizer only once per cycle.