A big batch on a small device
Sometimes the batch size you want will not fit in memory. Gradient accumulation lets you run several small forward and backward passes, add up their gradients, and only then take one optimizer step.
- Process several micro batches without updating weights.
- Sum or average their gradients into a running buffer.
- Apply the optimizer once, then clear the buffer.
Why it equals a big batch
Because the gradient is additive, summing gradients over micro batches gives the same result as one large batch through the network. You trade wall clock time for memory, since the passes run sequentially instead of together.
- Effective batch equals micro batch size times accumulation steps.
- Remember to scale the loss so the average is correct.
- It pairs naturally with mixed precision to stretch memory further.
Accumulate then step
Only after the chosen number of micro batches does the optimizer actually update the weights.
Key idea
Gradient accumulation sums gradients across several micro batches before one optimizer step, emulating a large batch on limited memory at the cost of time.