The problem it solves
A neural network has many layers and thousands or millions of weights. To train it with gradient descent, you need the gradient of the loss with respect to every weight. Backpropagation computes all of them efficiently in one backward sweep.
Forward then backward
Training each batch has two phases:
- The forward pass feeds inputs through the layers to produce a prediction and a loss
- The backward pass applies the chain rule from calculus, propagating the error gradient from the output back through each layer
By reusing intermediate results, backpropagation finds every gradient in time comparable to a single forward pass, rather than recomputing from scratch for each weight.
Why the chain rule
Each layer's output depends on the previous layer's output. The chain rule multiplies local derivatives along this path, so the gradient at an early layer is a product of many terms. This is also why very deep networks can suffer from vanishing or exploding gradients.
Key idea
Backpropagation uses the chain rule in a single backward pass to compute every weight's gradient efficiently, enabling deep networks to learn.