Convexity and Saddle Points
Optimization in this section means minimizing a smooth function . Two pieces of geometric structure decide the difficulty completely: whether the function is convex (every local minimum is global), and whether its critical points include saddles (zero gradient but no minimum). Both are characterized by the spectrum of the Hessian, which is why so much of optimization rests on the linear algebra of symmetric matrices.
Convex sets
Section titled “Convex sets”Half-spaces, balls, ellipsoids, intersections of convex sets, and the positive semidefinite cone are all convex. Discrete sets and the union of two disjoint balls are not.
Convex functions
Section titled “Convex functions”The convexity condition is exactly Jensen’s inequality restricted to two-point measures. For smooth it can be replaced by either of two pointwise tests.
The first-order condition is the geometric content of convexity: the graph lies above its tangent plane.
The tangent at any point of a convex graph is a global lower bound; equality holds only at the point of tangency. This is the inequality in one variable.
Local equals global
Section titled “Local equals global”The single best consequence of convexity.
For a differentiable convex , then, the equation is sufficient for to be a global minimum, not merely necessary.
Saddle points
Section titled “Saddle points”Without convexity, a zero gradient says only that is a critical point. The Hessian distinguishes the three possibilities.
At a saddle, the function decreases along the eigenvectors with negative eigenvalues and increases along those with positive eigenvalues. Gradient descent is therefore attracted to saddles (the gradient vanishes), and small noise or careful initialization is what allows iterative methods to escape them. In high-dimensional non-convex problems (deep networks) saddles are the dominant kind of critical point, far outnumbering minima.
The min-max principle
Section titled “The min-max principle”The eigenvalues of a symmetric matrix have a clean variational characterization that links optimization back to spectral theory. Recall the Rayleigh quotient
The Rayleigh quotient ties the algebra of symmetric matrices to optimization: the eigenvalue is the maximum value of a quadratic form on the unit sphere, and the eigenvector is the maximizer. Power iteration is gradient ascent on in disguise.
Why convexity is rare and structure helps
Section titled “Why convexity is rare and structure helps”Outside designed test problems, most in machine learning, control, and physics are not convex; deep network losses are aggressively non-convex with saddles in every direction. The practical consequence is that the convergence theorems for gradient descent come in two flavors: tight rates for convex (and especially strongly convex) functions, and weaker, “first-order stationary” guarantees in general. Convexity is the gold-standard regime; understanding it sets the benchmark against which non-convex behavior is measured.