Skip to content

Stochastic Gradient Descent

When the objective is an average over many examples,

f(x)  =  1ni=1nfi(x),f(\xv) \;=\; \frac{1}{n} \sum_{i=1}^n f_i(\xv),

the full gradient f=1nifi\nabla f = \tfrac{1}{n}\sum_i \nabla f_i costs an entire pass over the data per step. For modern nn (millions of training points, billions of parameters) that is the bottleneck. Stochastic gradient descent (SGD) estimates f\nabla f by a single random fi\nabla f_i, paying one example per step.

The stochastic gradient is unbiased, Eik[fik(x)]=f(x)\mathbb{E}_{i_k}[\nabla f_{i_k}(\xv)] = \nabla f(\xv), but it carries variance. Write

σ2(x)  =  Ei ⁣[fi(x)f(x)2].\sigma^2(\xv) \;=\; \mathbb{E}_i\!\left[\,\lVert \nabla f_i(\xv) - \nabla f(\xv) \rVert^2\,\right].

For a mini-batch of size bb drawn without replacement, the variance shrinks like σ2/b\sigma^2/b; doubling the batch size halves the noise but doubles the cost per step.

With a fixed step size η\eta, SGD does not converge to the minimum: it converges to a noise ball around it, whose size scales with η\eta and σ2\sigma^2.

Small η\eta shrinks the noise floor but slows the initial decrease; large η\eta races toward the minimum but bounces around it. This trade-off is fundamental to constant-step SGD.

To drive the iterate all the way to x\xv^\star, send ηk0\eta_k \to 0 at the right rate. The classical condition is from Robbins and Monro:

The canonical schedule is ηk=c/(k+k0)\eta_k = c/(k + k_0). Under strong convexity, this yields the sublinear rate

Exkx2  =  O(1/k),\mathbb{E}\,\lVert \xv_k - \xv^\star \rVert^2 \;=\; O(1/k),

slower than the linear (11/κ)k(1 - 1/\kappa)^k of full-batch GD, but each step is nn times cheaper, so the total compute to reach a fixed accuracy is often far less.

For non-convex losses (neural networks, in particular), the variance term in SGD is not just noise to be tolerated; it is often what makes the method work.

  • Escaping saddles. A saddle point has f=0\nabla f = \mathbf{0} but the Hessian is indefinite. Plain GD lingers there; the stochastic component of SGD acts as a kick that pushes the iterate off the unstable axis with high probability. The Brownian-motion analog of SGD spends only O(log)O(\log) time near saddles.
  • Implicit regularization. Among the many minima of an over-parameterized model, constant-η\eta SGD is biased toward the wide, flat ones, which generalize better than narrow, sharp minima. The same noise that prevents exact convergence selects which approximate minimum the iterate sits in.
  • Cheap iterations. A single fi\nabla f_i on a deep network costs one forward and one backward pass over a mini-batch, not the entire dataset. Modern training does millions of cheap steps where a deterministic method could not afford a single full-gradient pass.

The widget on the Gradient Descent page lets you toggle to SGD mode and dial the noise σ\sigma. On the bowl, watch the iterate orbit the minimum at a noise-floor radius proportional to ησ\eta\sigma. On the saddle, raise σ\sigma until SGD escapes; on the Rosenbrock, see how mild noise can either help or hurt depending on the step size.

In practice every SGD optimizer combines stochasticity with momentum: heavy-ball SGD, RMSprop, Adam, AdamW. The simplest such method is

vk+1  =  βvkηfik(xk),xk+1  =  xk+vk+1,\vv_{k+1} \;=\; \beta\,\vv_k - \eta\,\nabla f_{i_k}(\xv_k), \qquad \xv_{k+1} \;=\; \xv_k + \vv_{k+1},

matching the heavy-ball update on the previous page but with a stochastic gradient. The momentum term averages the noisy gradients over a window of effective length 1/(1β)1/(1 - \beta), which doubles as variance reduction and acceleration. This is the form of SGD that trains the vast majority of modern neural networks.