The Gradient Accumulation

A big batch on a small device

Sometimes the batch size you want will not fit in memory. Gradient accumulation lets you run several small forward and backward passes, add up their gradients, and only then take one optimizer step.

Process several micro batches without updating weights.
Sum or average their gradients into a running buffer.
Apply the optimizer once, then clear the buffer.

Why it equals a big batch

Because the gradient is additive, summing gradients over micro batches gives the same result as one large batch through the network. You trade wall clock time for memory, since the passes run sequentially instead of together.

Effective batch equals micro batch size times accumulation steps.
Remember to scale the loss so the average is correct.
It pairs naturally with mixed precision to stretch memory further.

Accumulate then step

Only after the chosen number of micro batches does the optimizer actually update the weights.

Key idea

Gradient accumulation sums gradients across several micro batches before one optimizer step, emulating a large batch on limited memory at the cost of time.

The Gradient Accumulation

A big batch on a small device

Why it equals a big batch

Accumulate then step

Key idea

Check yourself