Backpropagation

Training a neural network by gradient descent requires the gradient of the loss with respect to every weight and bias, often millions of them. Computing each partial derivative separately would be hopeless. Backpropagation computes all of them at once, in a single sweep backward through the layers, at a cost comparable to one evaluation of the network. It is the chain rule organized to reuse work, and it is what makes training deep networks feasible.

The forward pass and the goal

Fix an input $\xv$ and its target $y$ . The forward pass runs the network recurrence and records every intermediate quantity:

\av_0 = \xv, \qquad \zv_k = \Wv_k\av_{k-1} + \bv_k, \qquad \av_k = \sigma(\zv_k), \qquad L = \ell\big(\av_L,\, y\big).

The loss $L$ is a single scalar. We want the partial derivatives $\dfrac{\partial L}{\partial \Wv_k}$ and $\dfrac{\partial L}{\partial \bv_k}$ for every layer $k$ , which together form the gradient that gradient descent steps along.

The quantity to propagate is the sensitivity of the loss to each layer’s pre-activation.

The backpropagation equations

Once the adjoints are known, the weight gradients follow immediately; and the adjoints themselves satisfy a recurrence that runs from the output back toward the input.

Statement
Proof

With the adjoints $\boldsymbol{\delta}_k$ , the gradients of the loss are

\frac{\partial L}{\partial \Wv_k} = \boldsymbol{\delta}_k\,\av_{k-1}^{\rm T}, \qquad \frac{\partial L}{\partial \bv_k} = \boldsymbol{\delta}_k,

and the adjoints obey the backward recurrence

\boldsymbol{\delta}_L = \sigma'(\zv_L)\odot \nabla_{\!\av_L}\ell, \qquad \boldsymbol{\delta}_k = \big(\Wv_{k+1}^{\rm T}\,\boldsymbol{\delta}_{k+1}\big)\odot \sigma'(\zv_k),

where $\odot$ is the entrywise product and $\sigma'$ acts entrywise.

The two passes mirror each other. The forward pass sends activations $\av_k$ forward through the weight matrices $\Wv_k$ ; the backward pass sends adjoints $\boldsymbol{\delta}_k$ backward through their transposes $\Wv_{k+1}^{\rm T}$ , scaling by the local slope $\sigma'(\zv_k)$ at each layer. Each weight gradient is then the outer product of the adjoint arriving from the right with the activation arriving from the left.

Why it is efficient

Backpropagation is the special case of reverse-mode automatic differentiation applied to the network’s computation graph. The forward pass evaluates the graph and stores the intermediate values; the backward pass traverses the graph in reverse, accumulating the derivative of the single scalar output with respect to every node.

A small graph, made explicit

The widget below runs the forward and backward passes on a tiny network with one hidden unit. Black shows the forward pass: the values above each node and the weights on the edges. Accent shows the backward pass: the adjoints $\boldsymbol{\delta}$ below each node and the weight gradients below each edge. Adjust a weight and watch both passes update, then take a gradient step and watch the loss fall as each weight moves opposite its gradient.

w₁ = 1.20

b₁ = -0.40

w₂ = -1.50

b₂ = 0.60

Black: the forward pass values carried left to right. Accent: the adjoints δ and the weight gradients carried right to left. Each weight gradient is the adjoint to its right times the activation to its left. Gradient step moves every weight by −0.25·∂L/∂w, and the loss L = 0.437 drops toward zero.

Every quantity the step uses is read off the graph: the gradient on each weight is the product of the adjoint at the node to its right and the activation at the node to its left, exactly the outer-product rule. Backpropagation is this bookkeeping carried out over the whole network at once.