Gradient Descent
To minimize from a starting point , take a small step in the direction of steepest descent and repeat:
This is gradient descent. The single step is trivial; what makes the analysis non-trivial is choosing so the iterates actually move toward a minimum, and bounding how fast. Both pieces are controlled by the same scalar: the largest eigenvalue of the Hessian.
L-smoothness and the descent lemma
Section titled “L-smoothness and the descent lemma”L-smoothness gives the descent lemma, a quadratic upper bound on :
Plugging in and choosing collapses the right side to
so every step strictly decreases by a positive amount unless the gradient is already zero. This is why GD with never blows up.
Convergence for convex L-smooth functions
Section titled “Convergence for convex L-smooth functions”The descent lemma alone gives a sublinear rate.
To get within of the optimum needs steps. That is slow, and the next assumption fixes it.
Linear rate under strong convexity
Section titled “Linear rate under strong convexity”If is also -strongly convex (Hessian ), the rate becomes geometric.
Reaching error now takes only steps. The condition number is the single quantity that decides whether GD is fast or excruciatingly slow: a well-conditioned bowl () converges in essentially one step, while a long, narrow ellipse with converges thousands of times slower.
Heavy-ball momentum
Section titled “Heavy-ball momentum”When the loss has a long, narrow valley, plain GD zig-zags down the steep direction and crawls along the shallow one. Adding a velocity term, the heavy-ball method of Polyak, smooths the zig-zag.
The parameter is the momentum coefficient. The mechanical analogy is exact: is the velocity of a particle with friction rolling under the gradient force.
With well-chosen and , heavy-ball converges at rate on quadratic problems, a square-root speedup over plain GD’s . For that turns hundreds of slow steps into tens of fast ones.
Nesterov acceleration
Section titled “Nesterov acceleration”Nesterov’s accelerated gradient method keeps the rate but with a guarantee for all smooth convex functions, not just quadratics. The trick is to evaluate the gradient at a look-ahead point rather than the current iterate.
With a schedule of tending to a constant near , Nesterov achieves the optimal rate
provably matching the lower bound for first-order methods on smooth convex problems. Modern deep-learning optimizers (Adam, RMSprop, SGD-with-momentum) descend from these ideas.
Try it
Section titled “Try it”Each preset below has a different conditioning. The Bowl is well-conditioned (); plain GD nails it in one step. The Ellipse has , the long axis is 5× narrower than the short one: GD zig-zags, momentum sweeps along the valley. The Saddle has zero gradient at the origin but is not a minimum: GD that starts on the unstable axis lingers there until the small perpendicular component grows. The Rosenbrock function combines a curved valley with a shallow tail; even momentum needs careful tuning. Click on the surface to set a new starting point.
A few experiments worth running:
- Bowl with vs. : small takes many tiny steps, near overshoots and converges fast, diverges.
- Ellipse: compare GD with default to Momentum with . The trajectory transforms from zig-zag to a smooth glide.
- Saddle: start exactly at and watch GD stall; click to add a small perpendicular kick, and it escapes.
- SGD on the bowl with : the iterate keeps wandering near the minimum at a noise floor of order . Decreasing shrinks the cloud.