Second Order Methods Overview
First order methods use only the gradient, the slope. Second order methods also use curvature, the rate at which the slope changes, to choose better step sizes and directions.
The Hessian idea
- The Hessian is the matrix of second derivatives of the loss.
- It tells you how steeply the gradient itself changes in every direction.
- Newton's method multiplies the gradient by the inverse Hessian to jump toward the minimum.
Why it can help
In a ravine the curvature is large across the walls and small along the floor. Curvature information lets the optimizer take a long step along the floor and a short one across the walls, reaching the minimum in far fewer steps than first order methods.
Why it is rare in deep learning
The Hessian for a large network is enormous, and inverting it is infeasible. So practitioners use approximations. L-BFGS estimates curvature from recent gradients. K-FAC approximates the Hessian with a structured form. These reduce the cost but still rarely beat well tuned Adam at scale, so first order methods remain dominant while second order ideas inspire better preconditioners.
Key idea
Second order methods use curvature from the Hessian to take smarter steps, but its size forces approximations that rarely beat tuned first order optimizers at scale.