Neural Networks

A neural network is a function assembled from the two operations this course has studied throughout: a linear map, and a single fixed nonlinearity applied in between. Stacking these in alternation produces a parameterized family of functions rich enough to fit essentially any input-output relationship, yet built entirely from matrix multiplications and one scalar function. Everything that follows, the training by stochastic gradient descent and the gradient computation by backpropagation, rests on this structure.

The structure of a deep network

Each layer first takes a linear combination of the previous layer’s outputs, shifts by a bias, and then bends the result through $\sigma$ . The width of layer $k$ is the number of rows of $\Wv_k$ , the count of neurons in that layer; the depth is the number of layers $L$ .

The nonlinearity is not optional. Without it, each $\av_k = \zv_k = \Wv_k\av_{k-1} + \bv_k$ is affine, and a composition of affine maps is again affine:

F(\xv) = \Wv_L(\Wv_{L-1}(\cdots) + \bv_{L-1}) + \bv_L = \Wv\xv + \bv

for a single $\Wv = \Wv_L\cdots\Wv_1$ and matching $\bv$ . The whole deep network would collapse to one linear layer. The function $\sigma$ between the linear maps is exactly what prevents this collapse and lets depth add expressive power.

ReLU networks are piecewise linear

The standard choice is the rectified linear unit $\sigma(t) = \max(0, t)$ , which passes positives unchanged and zeros out negatives. It is itself piecewise linear, with a single kink at the origin, and this property propagates through the whole network.

Statement
Proof

If every $\sigma$ is the ReLU, then $F : \R^{n_0} \to \R^{n_L}$ is continuous and piecewise linear: the input space $\R^{n_0}$ splits into finitely many polyhedral regions, and $F$ is an affine map on each region.

A single hidden layer already makes this concrete. With $m$ hidden ReLU units and a scalar output,

F(\xv) = \sum_{j=1}^{m} c_j \,\max\!\big(0,\ \wv_j^{\rm T}\xv + b_j\big) + d,

a sum of $m$ hinge functions. In one dimension each hinge $\max(0, w_j x + b_j)$ is flat until its breakpoint $x = -b_j/w_j$ and linear afterward, so $F$ is a continuous piecewise-linear curve with up to $m$ breakpoints whose locations and slopes the parameters control.

How much a network can represent

Letting the width grow makes the piecewise-linear curve approximate any continuous target.

Statement
Proof

Let $f$ be continuous on a compact set $K \subset \R^{n_0}$ and let $\varepsilon > 0$ . There is a one-hidden-layer network $F$ with finitely many ReLU units such that $\lvert F(\xv) - f(\xv)\rvert < \varepsilon$ for all $\xv \in K$ .

Width buys approximation, but depth buys it more economically. Composing layers lets the partition of the input space refine multiplicatively: each added layer can fold the regions created by the previous ones, so the number of linear pieces a deep network realizes grows like a product across layers rather than a sum. Functions with that many pieces would require an exponentially wider shallow network to match. This is the quantitative reason depth helps.

Fit a one-hidden-layer ReLU network to the target curve below. The hidden weights set where the hinges bend; the output weights, solved by least squares once the hinges are fixed, set how they combine. Increasing the number of units adds breakpoints and drives the piecewise-linear fit toward the curve.

hidden units m = 6

target network fit (ticks mark the hinge breakpoints). Root-mean-square error 0.065 with 6 ReLU units. More units add breakpoints and lower the error toward zero.

The learning problem

Training chooses the parameters to fit data. Given examples $(\xv_i, y_i)$ for $i = 1, \dots, N$ and a loss $\ell$ measuring the gap between a prediction and its target, the network is fit by minimizing the empirical risk

\min_{\Wv_1, \bv_1, \dots, \Wv_L, \bv_L} \ \frac{1}{N}\sum_{i=1}^{N} \ell\big(F(\xv_i), \, y_i\big).

A feedforward network: an input layer, two hidden layers of neurons joined by weighted edges, and an output, with one affine map and one nonlinearity per layer

Each edge carries a weight in some $\Wv_k$ ; each neuron sums its inputs, adds a bias, and applies $\sigma$ . The map from input to output is the composition $F$ , and training adjusts every weight to reduce the loss.

Unlike least squares, this objective is generally nonconvex: $F$ depends on the weights through repeated products and nonlinearities, so the loss surface has many critical points rather than a single minimum. There is no normal-equation formula for the optimum. Instead the parameters are updated by stochastic gradient descent, which needs the gradient of the loss with respect to every weight. Computing that gradient efficiently, in a single sweep backward through the layers, is backpropagation.