Stochastic Gradient Descent
When the objective is an average over many examples,
the full gradient costs an entire pass over the data per step. For modern (millions of training points, billions of parameters) that is the bottleneck. Stochastic gradient descent (SGD) estimates by a single random , paying one example per step.
The SGD step
Section titled “The SGD step”The stochastic gradient is unbiased, , but it carries variance. Write
For a mini-batch of size drawn without replacement, the variance shrinks like ; doubling the batch size halves the noise but doubles the cost per step.
What constant-η SGD really does
Section titled “What constant-η SGD really does”With a fixed step size , SGD does not converge to the minimum: it converges to a noise ball around it, whose size scales with and .
Small shrinks the noise floor but slows the initial decrease; large races toward the minimum but bounces around it. This trade-off is fundamental to constant-step SGD.
To reach the minimum: decaying step sizes
Section titled “To reach the minimum: decaying step sizes”To drive the iterate all the way to , send at the right rate. The classical condition is from Robbins and Monro:
The canonical schedule is . Under strong convexity, this yields the sublinear rate
slower than the linear of full-batch GD, but each step is times cheaper, so the total compute to reach a fixed accuracy is often far less.
Why SGD wins for deep learning
Section titled “Why SGD wins for deep learning”For non-convex losses (neural networks, in particular), the variance term in SGD is not just noise to be tolerated; it is often what makes the method work.
- Escaping saddles. A saddle point has but the Hessian is indefinite. Plain GD lingers there; the stochastic component of SGD acts as a kick that pushes the iterate off the unstable axis with high probability. The Brownian-motion analog of SGD spends only time near saddles.
- Implicit regularization. Among the many minima of an over-parameterized model, constant- SGD is biased toward the wide, flat ones, which generalize better than narrow, sharp minima. The same noise that prevents exact convergence selects which approximate minimum the iterate sits in.
- Cheap iterations. A single on a deep network costs one forward and one backward pass over a mini-batch, not the entire dataset. Modern training does millions of cheap steps where a deterministic method could not afford a single full-gradient pass.
The widget on the Gradient Descent page lets you toggle to SGD mode and dial the noise . On the bowl, watch the iterate orbit the minimum at a noise-floor radius proportional to . On the saddle, raise until SGD escapes; on the Rosenbrock, see how mild noise can either help or hurt depending on the step size.
SGD with momentum
Section titled “SGD with momentum”In practice every SGD optimizer combines stochasticity with momentum: heavy-ball SGD, RMSprop, Adam, AdamW. The simplest such method is
matching the heavy-ball update on the previous page but with a stochastic gradient. The momentum term averages the noisy gradients over a window of effective length , which doubles as variance reduction and acceleration. This is the form of SGD that trains the vast majority of modern neural networks.