Gradient Descent Intuition
Gradient descent is the workhorse optimization method behind most machine learning. Its idea is wonderfully simple: to minimize a loss, repeatedly step in the direction that decreases it fastest.
The gradient is a vector of partial derivatives. It points in the direction of steepest increase of the loss. To go down, we step in the opposite direction, scaled by the learning rate. Repeat this and the loss generally falls toward a minimum.
There are flavors that trade accuracy for speed:
- Batch gradient descent uses the whole dataset for each step, giving a precise but expensive gradient
- Stochastic gradient descent uses one example at a time, noisy but fast
- Mini batch gradient descent uses a small group, the common compromise
The noise in stochastic and mini batch methods is not just a flaw. It can help the optimizer escape shallow traps and saddle points that a perfectly smooth descent might linger in. This is part of why mini batch training is so effective for large neural networks.
Convergence depends on the learning rate and the shape of the landscape. With a sensible rate, gradient descent reliably finds low loss regions even in spaces with millions of dimensions, which is remarkable given how little it computes at each step.
Key idea
Gradient descent minimizes loss by stepping opposite the gradient, and mini batch noise helps it escape shallow traps on the way down.