← Lessons

quiz vs the machine

Silver1100

Machine Learning

The Mini Batch Gradient Descent

Average gradients over a small batch to balance speed and stability.

4 min read · intro · beat Silver to climb

The middle ground

Mini batch gradient descent computes the gradient over a small group of examples, then updates once per batch. It sits between full batch and single example SGD.

  • A batch of, say, 32 to 256 examples is common.
  • The averaged gradient is less noisy than a single sample.
  • Updates are still far cheaper than a full pass.

Why it wins

Averaging over a batch smooths the noise while keeping updates frequent. It also maps well onto hardware: matrix operations over a batch run efficiently on a GPU.

  • Larger batches give smoother but slower updates.
  • Smaller batches add useful noise and speed.

Choosing a size

Batch size interacts with learning rate. Bigger batches often allow a larger learning rate, but too large a batch can hurt how well the model generalizes.

Mini batches are the default in modern deep learning training loops.

Key idea

Mini batch descent averages gradients over a small batch, balancing the smoothness of full batch with the speed of single example SGD and mapping well onto parallel hardware.

Check yourself

Answer to earn rating on the learn ladder.

1. What does a mini batch gradient average over?

2. Why are mini batches efficient on a GPU?