Convexity and Saddle Points

Optimization in this section means minimizing a smooth function $f : \R^n \to \R$ . Two pieces of geometric structure decide the difficulty completely: whether the function is convex (every local minimum is global), and whether its critical points include saddles (zero gradient but no minimum). Both are characterized by the spectrum of the Hessian, which is why so much of optimization rests on the linear algebra of symmetric matrices.

Convex sets

Half-spaces, balls, ellipsoids, intersections of convex sets, and the positive semidefinite cone are all convex. Discrete sets and the union of two disjoint balls are not.

Convex functions

The convexity condition is exactly Jensen’s inequality restricted to two-point measures. For smooth $f$ it can be replaced by either of two pointwise tests.

Statement
Proof sketch

Let $f : \R^n \to \R$ be differentiable on an open convex domain. The following are equivalent:

$f$ is convex.
First-order condition. For every $\xv, \yv$ ,

f(\yv) \;\ge\; f(\xv) + \nabla f(\xv)^{\rm T}(\yv - \xv).

Second-order condition (when $f$ is twice differentiable). The Hessian $\nabla^2 f(\xv)$ is positive semidefinite at every $\xv$ .

The strict / strong versions sharpen these: strict convexity replaces (2) and (3) with strict inequality / positive-definite Hessian, and $\mu$ -strong convexity adds the term $\tfrac{\mu}{2}\lVert \yv - \xv\rVert^2$ to the right side of (2) and forces $\nabla^2 f \succeq \mu \Iv$ .

The first-order condition is the geometric content of convexity: the graph lies above its tangent plane.

A convex parabola with a tangent line lying entirely below it, touching at one point

The tangent at any point of a convex graph is a global lower bound; equality holds only at the point of tangency. This is the inequality $f(\yv) \ge f(\xv) + \nabla f(\xv)^{\rm T}(\yv - \xv)$ in one variable.

Local equals global

The single best consequence of convexity.

Statement
Proof

Let $f$ be convex and let $\xv^\star$ be a local minimum, $f(\xv^\star) \le f(\xv)$ for every $\xv$ in some neighborhood of $\xv^\star$ . Then $f(\xv^\star) \le f(\xv)$ for every $\xv$ in the domain.

For a differentiable convex $f$ , then, the equation $\nabla f(\xv^\star) = \mathbf{0}$ is sufficient for $\xv^\star$ to be a global minimum, not merely necessary.

Saddle points

Without convexity, a zero gradient says only that $\xv$ is a critical point. The Hessian distinguishes the three possibilities.

At a saddle, the function decreases along the eigenvectors with negative eigenvalues and increases along those with positive eigenvalues. Gradient descent is therefore attracted to saddles (the gradient vanishes), and small noise or careful initialization is what allows iterative methods to escape them. In high-dimensional non-convex problems (deep networks) saddles are the dominant kind of critical point, far outnumbering minima.

The min-max principle

The eigenvalues of a symmetric matrix have a clean variational characterization that links optimization back to spectral theory. Recall the Rayleigh quotient

R(\xv) \;=\; \frac{\xv^{\rm T}\Av\,\xv}{\xv^{\rm T}\xv}, \qquad \xv \ne \mathbf{0}.

Statement
Proof sketch

Let $\Av \in \R^{n \times n}$ be symmetric with eigenvalues $\lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_n$ . Then

\lambda_k \;=\; \max_{\substack{V \subseteq \R^n \\ \dim V = k}} \;\min_{\substack{\xv \in V \\ \xv \ne 0}} \; R(\xv) \;=\; \min_{\substack{V \subseteq \R^n \\ \dim V = n - k + 1}} \;\max_{\substack{\xv \in V \\ \xv \ne 0}} \; R(\xv).

In particular $\lambda_1 = \max_\xv R(\xv)$ and $\lambda_n = \min_\xv R(\xv)$ .

The Rayleigh quotient ties the algebra of symmetric matrices to optimization: the eigenvalue $\lambda_1$ is the maximum value of a quadratic form on the unit sphere, and the eigenvector $\qv_1$ is the maximizer. Power iteration is gradient ascent on $R$ in disguise.

Why convexity is rare and structure helps

Outside designed test problems, most $f$ in machine learning, control, and physics are not convex; deep network losses are aggressively non-convex with saddles in every direction. The practical consequence is that the convergence theorems for gradient descent come in two flavors: tight rates for convex (and especially strongly convex) functions, and weaker, “first-order stationary” guarantees in general. Convexity is the gold-standard regime; understanding it sets the benchmark against which non-convex behavior is measured.