← Lessons

quiz vs the machine

Gold1320

Machine Learning

Gradient Accumulation

Simulate a large batch on small hardware by summing gradients over micro batches.

4 min read · core · beat Gold to climb

The problem

You want to train with a large batch, say 256, but only 32 samples fit in GPU memory at once. Buying more memory is not an option today. Gradient accumulation lets you reach the effective large batch anyway.

How it works

  • Run a forward and backward pass on a small micro batch.
  • Instead of updating, add the new gradients into a running buffer.
  • Repeat for several micro batches without stepping the optimizer.
  • After the chosen number of micro batches, apply one update and clear the buffer.

If you accumulate over eight micro batches of 32, the optimizer sees gradients equivalent to a batch of 256. The update is mathematically close to running that large batch in one shot.

Details that matter

  • Average, do not just sum: scale by the number of micro batches so the gradient magnitude matches a true large batch.
  • It trades time for memory. You do the same compute but in serial chunks, so the step takes longer in wall clock time.

Key idea

Gradient accumulation sums gradients over several micro batches before a single update, simulating a large effective batch on limited memory at the cost of extra time.

Check yourself

Answer to earn rating on the learn ladder.

1. What does gradient accumulation let you simulate?

2. Why should you average gradients over the micro batches?