Skip to content

Gradient Descent

To minimize ff from a starting point x0\xv_0, take a small step in the direction of steepest descent and repeat:

xk+1  =  xkηf(xk).\xv_{k+1} \;=\; \xv_k - \eta\,\nabla f(\xv_k).

This is gradient descent. The single step is trivial; what makes the analysis non-trivial is choosing η\eta so the iterates actually move toward a minimum, and bounding how fast. Both pieces are controlled by the same scalar: the largest eigenvalue of the Hessian.

L-smoothness gives the descent lemma, a quadratic upper bound on ff:

f(y)    f(x)+f(x)T(yx)+L2yx2.f(\yv) \;\le\; f(\xv) + \nabla f(\xv)^{\rm T}(\yv - \xv) + \tfrac{L}{2}\lVert \yv - \xv \rVert^2.

Plugging in y=xηf(x)\yv = \xv - \eta\,\nabla f(\xv) and choosing η=1/L\eta = 1/L collapses the right side to

f(xk+1)    f(xk)12Lf(xk)2,f(\xv_{k+1}) \;\le\; f(\xv_k) - \tfrac{1}{2L}\lVert \nabla f(\xv_k) \rVert^2,

so every step strictly decreases ff by a positive amount unless the gradient is already zero. This is why GD with η1/L\eta \le 1/L never blows up.

The descent lemma alone gives a sublinear rate.

To get within ϵ\epsilon of the optimum needs O(L/ϵ)O(L/\epsilon) steps. That is slow, and the next assumption fixes it.

If ff is also μ\mu-strongly convex (Hessian μI\succeq \mu \Iv), the rate becomes geometric.

Reaching error ϵ\epsilon now takes only O(κlog(1/ϵ))O(\kappa \log(1/\epsilon)) steps. The condition number κ\kappa is the single quantity that decides whether GD is fast or excruciatingly slow: a well-conditioned bowl (κ=1\kappa = 1) converges in essentially one step, while a long, narrow ellipse with κ=1000\kappa = 1000 converges thousands of times slower.

When the loss has a long, narrow valley, plain GD zig-zags down the steep direction and crawls along the shallow one. Adding a velocity term, the heavy-ball method of Polyak, smooths the zig-zag.

vk+1  =  βvkηf(xk),xk+1  =  xk+vk+1.\vv_{k+1} \;=\; \beta\,\vv_k - \eta\,\nabla f(\xv_k), \qquad \xv_{k+1} \;=\; \xv_k + \vv_{k+1}.

The parameter β[0,1)\beta \in [0, 1) is the momentum coefficient. The mechanical analogy is exact: v\vv is the velocity of a particle with friction 1β1-\beta rolling under the gradient force.

With well-chosen η\eta and β\beta, heavy-ball converges at rate 11/κ1 - 1/\sqrt{\kappa} on quadratic problems, a square-root speedup over plain GD’s 11/κ1 - 1/\kappa. For κ=1000\kappa = 1000 that turns hundreds of slow steps into tens of fast ones.

Nesterov’s accelerated gradient method keeps the κ\sqrt{\kappa} rate but with a guarantee for all smooth convex functions, not just quadratics. The trick is to evaluate the gradient at a look-ahead point rather than the current iterate.

yk=xk+βk(xkxk1),xk+1=ykηf(yk).\begin{aligned} \yv_k &= \xv_k + \beta_k\,(\xv_k - \xv_{k-1}), \\ \xv_{k+1} &= \yv_k - \eta\,\nabla f(\yv_k). \end{aligned}

With a schedule of βk\beta_k tending to a constant near 11/κ1 - 1/\sqrt{\kappa}, Nesterov achieves the optimal rate

f(xk)f(x)    O ⁣((11/κ)k),f(\xv_k) - f(\xv^\star) \;\le\; O\!\left( (1 - 1/\sqrt{\kappa})^{k} \right),

provably matching the lower bound for first-order methods on smooth convex problems. Modern deep-learning optimizers (Adam, RMSprop, SGD-with-momentum) descend from these ideas.

Each preset below has a different conditioning. The Bowl is well-conditioned (κ=1\kappa = 1); plain GD nails it in one step. The Ellipse has κ=25\kappa = 25, the long axis is 5× narrower than the short one: GD zig-zags, momentum sweeps along the valley. The Saddle has zero gradient at the origin but is not a minimum: GD that starts on the unstable axis lingers there until the small perpendicular component grows. The Rosenbrock function combines a curved valley with a shallow tail; even momentum needs careful tuning. Click on the surface to set a new starting point.

low losshigh loss
step 0; current loss f = 3.620. The white ring marks the true minimum where one exists. Click anywhere on the surface to set a new starting point.

A few experiments worth running:

  • Bowl with η=0.05\eta = 0.05 vs. η=0.4\eta = 0.4: small η\eta takes many tiny steps, η\eta near 11 overshoots and converges fast, η>2/L\eta > 2/L diverges.
  • Ellipse: compare GD with default η\eta to Momentum with β=0.9\beta = 0.9. The trajectory transforms from zig-zag to a smooth glide.
  • Saddle: start exactly at y=0y = 0 and watch GD stall; click to add a small perpendicular kick, and it escapes.
  • SGD on the bowl with σ>0\sigma > 0: the iterate keeps wandering near the minimum at a noise floor of order ησ\eta\sigma. Decreasing η\eta shrinks the cloud.