← Lessons

quiz vs the machine

Gold1320

Machine Learning

The Gradient Accumulation

Simulate a large batch on small memory by summing gradients over steps.

4 min read · core · beat Gold to climb

A big batch on a small device

Sometimes the batch size you want will not fit in memory. Gradient accumulation lets you run several small forward and backward passes, add up their gradients, and only then take one optimizer step.

  • Process several micro batches without updating weights.
  • Sum or average their gradients into a running buffer.
  • Apply the optimizer once, then clear the buffer.

Why it equals a big batch

Because the gradient is additive, summing gradients over micro batches gives the same result as one large batch through the network. You trade wall clock time for memory, since the passes run sequentially instead of together.

  • Effective batch equals micro batch size times accumulation steps.
  • Remember to scale the loss so the average is correct.
  • It pairs naturally with mixed precision to stretch memory further.

Accumulate then step

Only after the chosen number of micro batches does the optimizer actually update the weights.

Key idea

Gradient accumulation sums gradients across several micro batches before one optimizer step, emulating a large batch on limited memory at the cost of time.

Check yourself

Answer to earn rating on the learn ladder.

1. What does gradient accumulation trade for a larger effective batch?

2. What is the effective batch size with accumulation?