Neural Networks
A neural network is a function assembled from the two operations this course has studied throughout: a linear map, and a single fixed nonlinearity applied in between. Stacking these in alternation produces a parameterized family of functions rich enough to fit essentially any input-output relationship, yet built entirely from matrix multiplications and one scalar function. Everything that follows, the training by stochastic gradient descent and the gradient computation by backpropagation, rests on this structure.
The structure of a deep network
Section titled “The structure of a deep network”Each layer first takes a linear combination of the previous layer’s outputs, shifts by a bias, and then bends the result through . The width of layer is the number of rows of , the count of neurons in that layer; the depth is the number of layers .
The nonlinearity is not optional. Without it, each is affine, and a composition of affine maps is again affine:
for a single and matching . The whole deep network would collapse to one linear layer. The function between the linear maps is exactly what prevents this collapse and lets depth add expressive power.
ReLU networks are piecewise linear
Section titled “ReLU networks are piecewise linear”The standard choice is the rectified linear unit , which passes positives unchanged and zeros out negatives. It is itself piecewise linear, with a single kink at the origin, and this property propagates through the whole network.
A single hidden layer already makes this concrete. With hidden ReLU units and a scalar output,
a sum of hinge functions. In one dimension each hinge is flat until its breakpoint and linear afterward, so is a continuous piecewise-linear curve with up to breakpoints whose locations and slopes the parameters control.
How much a network can represent
Section titled “How much a network can represent”Letting the width grow makes the piecewise-linear curve approximate any continuous target.
Width buys approximation, but depth buys it more economically. Composing layers lets the partition of the input space refine multiplicatively: each added layer can fold the regions created by the previous ones, so the number of linear pieces a deep network realizes grows like a product across layers rather than a sum. Functions with that many pieces would require an exponentially wider shallow network to match. This is the quantitative reason depth helps.
Fit a one-hidden-layer ReLU network to the target curve below. The hidden weights set where the hinges bend; the output weights, solved by least squares once the hinges are fixed, set how they combine. Increasing the number of units adds breakpoints and drives the piecewise-linear fit toward the curve.
The learning problem
Section titled “The learning problem”Training chooses the parameters to fit data. Given examples for and a loss measuring the gap between a prediction and its target, the network is fit by minimizing the empirical risk
Each edge carries a weight in some ; each neuron sums its inputs, adds a bias, and applies . The map from input to output is the composition , and training adjusts every weight to reduce the loss.
Unlike least squares, this objective is generally nonconvex: depends on the weights through repeated products and nonlinearities, so the loss surface has many critical points rather than a single minimum. There is no normal-equation formula for the optimum. Instead the parameters are updated by stochastic gradient descent, which needs the gradient of the loss with respect to every weight. Computing that gradient efficiently, in a single sweep backward through the layers, is backpropagation.