The Chain Rule in Backprop
A neural network is a long composition of functions. To train it we need the gradient of the loss with respect to every weight, and the chain rule is the tool that delivers those gradients efficiently.
The core rule
The chain rule says the derivative of a composed function is the product of the derivatives of its parts. If the loss depends on a weight only through several intermediate values, you multiply the local derivatives along that path.
Forward then backward
- The forward pass computes activations layer by layer and caches them.
- The backward pass starts at the loss and multiplies local gradients moving toward the inputs.
- Each layer receives an incoming gradient, scales it, and passes it on.
Why it is efficient
Computing each weight gradient separately would repeat enormous amounts of work. Backpropagation reuses shared intermediate results, so the cost of all gradients is about the same as one forward pass. This reuse is what makes training networks with millions of parameters feasible.
Key idea
Backpropagation applies the chain rule to multiply local gradients backward, reusing cached values to get all weight gradients cheaply.