← Lessons

quiz vs the machine

Platinum1720

Machine Learning

SGD Versus Minibatch

Trading off gradient noise, speed, and hardware use in gradient descent.

5 min read · advanced · beat Platinum to climb

Three flavors

Gradient descent differs by how many examples it uses per update.

  • Batch gradient descent uses the whole dataset per step. Each step is accurate but slow.
  • Stochastic gradient descent, or SGD, uses one example per step. Steps are fast and noisy.
  • Minibatch uses a small group, the common middle ground.

Why noise can help

The noise in SGD is not purely bad.

  • It lets the path escape shallow local minima and saddle points.
  • It acts as a mild regularizer, discouraging sharp overfit solutions.
  • But pure SGD updates are jittery and underuse modern hardware.

Why minibatch wins in practice

  • A batch of, say, sixty four uses vectorized hardware efficiently.
  • Averaging over the batch gives smoother gradients than single examples.
  • The batch size becomes a knob balancing noise, speed, and memory.

Practical notes

  • Larger batches need a larger or warmed up learning rate.
  • Shuffle the data each epoch so batches stay representative.
  • Very large batches can hurt generalization if not tuned carefully.

Key idea

Minibatch gradient descent balances the accuracy of full batch and the helpful noise of single example SGD while using hardware efficiently.

Check yourself

Answer to earn rating on the learn ladder.

1. How many examples does pure SGD use per update?

2. Why is minibatch usually preferred in practice?