Skip to content

Least Squares

When there are more equations than unknowns, Ax=b\Av\xv = \bv usually has no solution: b\bv does not lie in the column space C(A)C(\Av). Rather than give up, we ask for the x\xv that comes closest, minimizing the residual in the Euclidean norm. This single problem has four standard solutions, and seeing that they agree ties together projection, orthogonality, A=QR\Av = \Qv\Rv, and the SVD.

The objective f(x)=Axb2=(Axb)T(Axb)f(\xv) = \lVert \Av\xv - \bv \rVert^2 = (\Av\xv - \bv)^{\rm T}(\Av\xv - \bv) is a convex quadratic, so its minimizers are exactly its stationary points. Setting the gradient to zero, f=2AT(Axb)=0\nabla f = 2\Av^{\rm T}(\Av\xv - \bv) = 0, gives the normal equations.

2. The geometry: projection onto the column space

Section titled “2. The geometry: projection onto the column space”

The normal equations say AT(bAx^)=0\Av^{\rm T}(\bv - \Av\hat\xv) = 0, i.e. the error e=bAx^\ev = \bv - \Av\hat\xv is orthogonal to every column of A\Av. So eC(A)\ev \perp C(\Av), and the fitted vector p=Ax^\pv = \Av\hat\xv is the orthogonal projection of b\bv onto the column space: the closest point of C(A)C(\Av) to b\bv.

The data vector b, its orthogonal projection p onto the plane C(A), and the perpendicular error e = b − p

Least squares splits b\bv into a part p=Ax^\pv = \Av\hat\xv inside C(A)C(\Av) and a part e=bp\ev = \bv - \pv orthogonal to it. Minimizing Axb\lVert \Av\xv - \bv\rVert is choosing the point of the plane nearest b\bv, the foot of the perpendicular.

When ATA\Av^{\rm T}\Av is invertible, substituting x^\hat\xv gives p=Pb\pv = \Pv\bv with the projection matrix

P=A(ATA)1AT,\Pv = \Av(\Av^{\rm T}\Av)^{-1}\Av^{\rm T},

which satisfies P=PT\Pv = \Pv^{\rm T} and P2=P\Pv^2 = \Pv, the algebraic signature of an orthogonal projection.

Fitting a line or parabola to data is exactly this. Each data point (ti,yi)(t_i, y_i) contributes a row (1ti)\begin{pmatrix} 1 & t_i & \cdots \end{pmatrix} to A\Av and an entry yiy_i to b\bv; the least-squares coefficients place the curve so the vertical residuals have the smallest total square. Drag the points below and watch the fit and the residual norm respond.

224466881010
Drag the points. The line minimizes the total squared residual ‖e‖ = 4.99; the red segments are the errors eᵢ, the residual of the projection of the data onto the column space of A.

Forming ATA\Av^{\rm T}\Av explicitly is the numerically fragile step. With the factorization A=QR\Av = \Qv\Rv (Q\Qv orthonormal columns, R\Rv upper triangular and invertible when columns are independent), the normal equations collapse: ATA=RTQTQR=RTR\Av^{\rm T}\Av = \Rv^{\rm T}\Qv^{\rm T}\Qv\Rv = \Rv^{\rm T}\Rv and ATb=RTQTb\Av^{\rm T}\bv = \Rv^{\rm T}\Qv^{\rm T}\bv, so

RTRx^=RTQTb    Rx^=QTb.\Rv^{\rm T}\Rv\,\hat\xv = \Rv^{\rm T}\Qv^{\rm T}\bv \;\Longrightarrow\; \Rv\hat\xv = \Qv^{\rm T}\bv.

This is one triangular solve, and it never forms ATA\Av^{\rm T}\Av, so it avoids the conditioning penalty below. Here the projection is p=QQTb\pv = \Qv\Qv^{\rm T}\bv.

The most general route uses A=UΣVT\Av = \Uv\Sigmav\Vv^{\rm T}. The least-squares solution of minimum norm is

x^=A+b=VΣ+UTb,\hat\xv = \Av^{+}\bv = \Vv\Sigmav^{+}\Uv^{\rm T}\bv,

where A+\Av^{+} is the pseudoinverse. Unlike the first three routes, this needs no independence assumption: when the columns are dependent there are infinitely many least-squares solutions, and A+b\Av^+\bv selects the shortest one. This unifies least squares with the underdetermined case and is the subject of the next page.

All four give the same x^\hat\xv in exact arithmetic. They differ in stability. The normal-equations matrix ATA\Av^{\rm T}\Av has condition number

κ(ATA)=κ(A)2,\kappa(\Av^{\rm T}\Av) = \kappa(\Av)^2,

so nearly dependent columns, large κ(A)\kappa(\Av), are squared into a much worse problem, and small errors in b\bv or in storing ATA\Av^{\rm T}\Av are amplified.