The Mini Batch Gradient Descent

The middle ground

Mini batch gradient descent computes the gradient over a small group of examples, then updates once per batch. It sits between full batch and single example SGD.

A batch of, say, 32 to 256 examples is common.
The averaged gradient is less noisy than a single sample.
Updates are still far cheaper than a full pass.

Why it wins

Averaging over a batch smooths the noise while keeping updates frequent. It also maps well onto hardware: matrix operations over a batch run efficiently on a GPU.

Larger batches give smoother but slower updates.
Smaller batches add useful noise and speed.

Choosing a size

Batch size interacts with learning rate. Bigger batches often allow a larger learning rate, but too large a batch can hurt how well the model generalizes.

Mini batches are the default in modern deep learning training loops.

Key idea

Mini batch descent averages gradients over a small batch, balancing the smoothness of full batch with the speed of single example SGD and mapping well onto parallel hardware.

The Mini Batch Gradient Descent

The middle ground

Why it wins

Choosing a size

Key idea

Check yourself