Skip to content

Neural Networks

A neural network is a function assembled from the two operations this course has studied throughout: a linear map, and a single fixed nonlinearity applied in between. Stacking these in alternation produces a parameterized family of functions rich enough to fit essentially any input-output relationship, yet built entirely from matrix multiplications and one scalar function. Everything that follows, the training by stochastic gradient descent and the gradient computation by backpropagation, rests on this structure.

Each layer first takes a linear combination of the previous layer’s outputs, shifts by a bias, and then bends the result through σ\sigma. The width of layer kk is the number of rows of Wk\Wv_k, the count of neurons in that layer; the depth is the number of layers LL.

The nonlinearity is not optional. Without it, each ak=zk=Wkak1+bk\av_k = \zv_k = \Wv_k\av_{k-1} + \bv_k is affine, and a composition of affine maps is again affine:

F(x)=WL(WL1()+bL1)+bL=Wx+bF(\xv) = \Wv_L(\Wv_{L-1}(\cdots) + \bv_{L-1}) + \bv_L = \Wv\xv + \bv

for a single W=WLW1\Wv = \Wv_L\cdots\Wv_1 and matching b\bv. The whole deep network would collapse to one linear layer. The function σ\sigma between the linear maps is exactly what prevents this collapse and lets depth add expressive power.

The standard choice is the rectified linear unit σ(t)=max(0,t)\sigma(t) = \max(0, t), which passes positives unchanged and zeros out negatives. It is itself piecewise linear, with a single kink at the origin, and this property propagates through the whole network.

A single hidden layer already makes this concrete. With mm hidden ReLU units and a scalar output,

F(x)=j=1mcjmax ⁣(0, wjTx+bj)+d,F(\xv) = \sum_{j=1}^{m} c_j \,\max\!\big(0,\ \wv_j^{\rm T}\xv + b_j\big) + d,

a sum of mm hinge functions. In one dimension each hinge max(0,wjx+bj)\max(0, w_j x + b_j) is flat until its breakpoint x=bj/wjx = -b_j/w_j and linear afterward, so FF is a continuous piecewise-linear curve with up to mm breakpoints whose locations and slopes the parameters control.

Letting the width grow makes the piecewise-linear curve approximate any continuous target.

Width buys approximation, but depth buys it more economically. Composing layers lets the partition of the input space refine multiplicatively: each added layer can fold the regions created by the previous ones, so the number of linear pieces a deep network realizes grows like a product across layers rather than a sum. Functions with that many pieces would require an exponentially wider shallow network to match. This is the quantitative reason depth helps.

Fit a one-hidden-layer ReLU network to the target curve below. The hidden weights set where the hinges bend; the output weights, solved by least squares once the hinges are fixed, set how they combine. Increasing the number of units adds breakpoints and drives the piecewise-linear fit toward the curve.

target  network fit (ticks mark the hinge breakpoints). Root-mean-square error 0.065 with 6 ReLU units. More units add breakpoints and lower the error toward zero.

Training chooses the parameters to fit data. Given examples (xi,yi)(\xv_i, y_i) for i=1,,Ni = 1, \dots, N and a loss \ell measuring the gap between a prediction and its target, the network is fit by minimizing the empirical risk

minW1,b1,,WL,bL 1Ni=1N(F(xi),yi).\min_{\Wv_1, \bv_1, \dots, \Wv_L, \bv_L} \ \frac{1}{N}\sum_{i=1}^{N} \ell\big(F(\xv_i), \, y_i\big).

A feedforward network: an input layer, two hidden layers of neurons joined by weighted edges, and an output, with one affine map and one nonlinearity per layer

Each edge carries a weight in some Wk\Wv_k; each neuron sums its inputs, adds a bias, and applies σ\sigma. The map from input to output is the composition FF, and training adjusts every weight to reduce the loss.

Unlike least squares, this objective is generally nonconvex: FF depends on the weights through repeated products and nonlinearities, so the loss surface has many critical points rather than a single minimum. There is no normal-equation formula for the optimum. Instead the parameters are updated by stochastic gradient descent, which needs the gradient of the loss with respect to every weight. Computing that gradient efficiently, in a single sweep backward through the layers, is backpropagation.