Least Squares

When there are more equations than unknowns, $\Av\xv = \bv$ usually has no solution: $\bv$ does not lie in the column space $C(\Av)$ . Rather than give up, we ask for the $\xv$ that comes closest, minimizing the residual in the Euclidean norm. This single problem has four standard solutions, and seeing that they agree ties together projection, orthogonality, $\Av = \Qv\Rv$ , and the SVD.

1. The normal equations

The objective $f(\xv) = \lVert \Av\xv - \bv \rVert^2 = (\Av\xv - \bv)^{\rm T}(\Av\xv - \bv)$ is a convex quadratic, so its minimizers are exactly its stationary points. Setting the gradient to zero, $\nabla f = 2\Av^{\rm T}(\Av\xv - \bv) = 0$ , gives the normal equations.

Statement
Proof

$\hat\xv$ minimizes $\lVert \Av\xv - \bv \rVert^2$ if and only if

\Av^{\rm T}\Av\,\hat\xv = \Av^{\rm T}\bv.

If the columns of $\Av$ are independent, $\Av^{\rm T}\Av$ is invertible and the solution is unique:

\hat\xv = (\Av^{\rm T}\Av)^{-1}\Av^{\rm T}\bv.

2. The geometry: projection onto the column space

The normal equations say $\Av^{\rm T}(\bv - \Av\hat\xv) = 0$ , i.e. the error $\ev = \bv - \Av\hat\xv$ is orthogonal to every column of $\Av$ . So $\ev \perp C(\Av)$ , and the fitted vector $\pv = \Av\hat\xv$ is the orthogonal projection of $\bv$ onto the column space: the closest point of $C(\Av)$ to $\bv$ .

The data vector b, its orthogonal projection p onto the plane C(A), and the perpendicular error e = b − p

Least squares splits $\bv$ into a part $\pv = \Av\hat\xv$ inside $C(\Av)$ and a part $\ev = \bv - \pv$ orthogonal to it. Minimizing $\lVert \Av\xv - \bv\rVert$ is choosing the point of the plane nearest $\bv$ , the foot of the perpendicular.

When $\Av^{\rm T}\Av$ is invertible, substituting $\hat\xv$ gives $\pv = \Pv\bv$ with the projection matrix

\Pv = \Av(\Av^{\rm T}\Av)^{-1}\Av^{\rm T},

which satisfies $\Pv = \Pv^{\rm T}$ and $\Pv^2 = \Pv$ , the algebraic signature of an orthogonal projection.

Fitting a line or parabola to data is exactly this. Each data point $(t_i, y_i)$ contributes a row $\begin{pmatrix} 1 & t_i & \cdots \end{pmatrix}$ to $\Av$ and an entry $y_i$ to $\bv$ ; the least-squares coefficients place the curve so the vertical residuals have the smallest total square. Drag the points below and watch the fit and the residual norm respond.

Drag the points. The line minimizes the total squared residual ‖e‖ = 4.99; the red segments are the errors eᵢ, the residual of the projection of the data onto the column space of A.

3. Through QR

Forming $\Av^{\rm T}\Av$ explicitly is the numerically fragile step. With the factorization $\Av = \Qv\Rv$ ( $\Qv$ orthonormal columns, $\Rv$ upper triangular and invertible when columns are independent), the normal equations collapse: $\Av^{\rm T}\Av = \Rv^{\rm T}\Qv^{\rm T}\Qv\Rv = \Rv^{\rm T}\Rv$ and $\Av^{\rm T}\bv = \Rv^{\rm T}\Qv^{\rm T}\bv$ , so

\Rv^{\rm T}\Rv\,\hat\xv = \Rv^{\rm T}\Qv^{\rm T}\bv \;\Longrightarrow\; \Rv\hat\xv = \Qv^{\rm T}\bv.

This is one triangular solve, and it never forms $\Av^{\rm T}\Av$ , so it avoids the conditioning penalty below. Here the projection is $\pv = \Qv\Qv^{\rm T}\bv$ .

4. Through the SVD and pseudoinverse

The most general route uses $\Av = \Uv\Sigmav\Vv^{\rm T}$ . The least-squares solution of minimum norm is

\hat\xv = \Av^{+}\bv = \Vv\Sigmav^{+}\Uv^{\rm T}\bv,

where $\Av^{+}$ is the pseudoinverse. Unlike the first three routes, this needs no independence assumption: when the columns are dependent there are infinitely many least-squares solutions, and $\Av^+\bv$ selects the shortest one. This unifies least squares with the underdetermined case and is the subject of the next page.

When the routes disagree numerically

All four give the same $\hat\xv$ in exact arithmetic. They differ in stability. The normal-equations matrix $\Av^{\rm T}\Av$ has condition number

\kappa(\Av^{\rm T}\Av) = \kappa(\Av)^2,

so nearly dependent columns, large $\kappa(\Av)$ , are squared into a much worse problem, and small errors in $\bv$ or in storing $\Av^{\rm T}\Av$ are amplified.