Gradient Accumulation

Simulate a large batch on small hardware by summing gradients over micro batches.

The problem

You want to train with a large batch, say 256, but only 32 samples fit in GPU memory at once. Buying more memory is not an option today. Gradient accumulation lets you reach the effective large batch anyway.

How it works

Run a forward and backward pass on a small micro batch.
Instead of updating, add the new gradients into a running buffer.
Repeat for several micro batches without stepping the optimizer.
After the chosen number of micro batches, apply one update and clear the buffer.

If you accumulate over eight micro batches of 32, the optimizer sees gradients equivalent to a batch of 256. The update is mathematically close to running that large batch in one shot.

Details that matter

Average, do not just sum: scale by the number of micro batches so the gradient magnitude matches a true large batch.
It trades time for memory. You do the same compute but in serial chunks, so the step takes longer in wall clock time.

Key idea

Gradient accumulation sums gradients over several micro batches before a single update, simulating a large effective batch on limited memory at the cost of extra time.

Gradient Accumulation

The problem

How it works

Details that matter

Key idea

Check yourself