Skip to content

Convexity and Saddle Points

Optimization in this section means minimizing a smooth function f:RnRf : \R^n \to \R. Two pieces of geometric structure decide the difficulty completely: whether the function is convex (every local minimum is global), and whether its critical points include saddles (zero gradient but no minimum). Both are characterized by the spectrum of the Hessian, which is why so much of optimization rests on the linear algebra of symmetric matrices.

Half-spaces, balls, ellipsoids, intersections of convex sets, and the positive semidefinite cone are all convex. Discrete sets and the union of two disjoint balls are not.

The convexity condition is exactly Jensen’s inequality restricted to two-point measures. For smooth ff it can be replaced by either of two pointwise tests.

The first-order condition is the geometric content of convexity: the graph lies above its tangent plane.

A convex parabola with a tangent line lying entirely below it, touching at one point

The tangent at any point of a convex graph is a global lower bound; equality holds only at the point of tangency. This is the inequality f(y)f(x)+f(x)T(yx)f(\yv) \ge f(\xv) + \nabla f(\xv)^{\rm T}(\yv - \xv) in one variable.

The single best consequence of convexity.

For a differentiable convex ff, then, the equation f(x)=0\nabla f(\xv^\star) = \mathbf{0} is sufficient for x\xv^\star to be a global minimum, not merely necessary.

Without convexity, a zero gradient says only that x\xv is a critical point. The Hessian distinguishes the three possibilities.

At a saddle, the function decreases along the eigenvectors with negative eigenvalues and increases along those with positive eigenvalues. Gradient descent is therefore attracted to saddles (the gradient vanishes), and small noise or careful initialization is what allows iterative methods to escape them. In high-dimensional non-convex problems (deep networks) saddles are the dominant kind of critical point, far outnumbering minima.

The eigenvalues of a symmetric matrix have a clean variational characterization that links optimization back to spectral theory. Recall the Rayleigh quotient

R(x)  =  xTAxxTx,x0.R(\xv) \;=\; \frac{\xv^{\rm T}\Av\,\xv}{\xv^{\rm T}\xv}, \qquad \xv \ne \mathbf{0}.

The Rayleigh quotient ties the algebra of symmetric matrices to optimization: the eigenvalue λ1\lambda_1 is the maximum value of a quadratic form on the unit sphere, and the eigenvector q1\qv_1 is the maximizer. Power iteration is gradient ascent on RR in disguise.

Outside designed test problems, most ff in machine learning, control, and physics are not convex; deep network losses are aggressively non-convex with saddles in every direction. The practical consequence is that the convergence theorems for gradient descent come in two flavors: tight rates for convex (and especially strongly convex) functions, and weaker, “first-order stationary” guarantees in general. Convexity is the gold-standard regime; understanding it sets the benchmark against which non-convex behavior is measured.