Skip to content

Inequalities

We can use integrability to bound the probability of “tail events” (values far from the mean). These inequalities form the basis for proving laws of large numbers.

A non-negative function g dominating the scaled indicator step L·1 on the region B

The whole argument in one picture: the step L1BL \cdot \mathbb{1}_B never rises above g(x)g(x), touching it only at the boundary of BB. Integrating this pointwise domination against the law of XX turns area into probability and gives P(XB)E[g(X)]/L\mathbb{P}(X \in B) \le \mathbb{E}[g(X)] / L. Markov’s inequality is the special case g(x)=xg(x) = x, B={xa}B = \{x \ge a\}, where L=aL = a.

Alternatively, for the raw second moment:

P(Xa)E[X2]a2\mathbb{P}(|X| \ge a) \le \frac{\mathbb{E}[X^2]}{a^2}

A parabola centered at the mean dominating a two-sided indicator step outside the band μ±a

Chebyshev is the general bound with g(x)=(xμ)2g(x) = (x - \mu)^2 and B={xμa}B = \{|x - \mu| \ge a\}. The parabola (xμ)2/a2(x-\mu)^2/a^2 sits above the indicator of the tail event, equal to it precisely at μ±a\mu \pm a. Integrating turns the squared deviation into variance: P(Xμa)Var(X)/a2\mathbb{P}(|X - \mu| \ge a) \le \mathrm{Var}(X)/a^2.

The previous inequalities bound tail probabilities. The next one is of a different kind: it relates the expectation of a convex transformation to the transformation of the expectation. It is the workhorse behind moment comparisons and the contraction properties of averaging operators.

A convex curve sagging below its chord, with the Jensen gap between φ(E[X]) on the curve and E[φ(X)] on the chord

For a two-point variable taking x1x_1 and x2x_2 with equal weight, E[X]\mathbb{E}[X] is the midpoint and E[φ(X)]\mathbb{E}[\varphi(X)] is the chord’s height there. Convexity keeps the chord above the curve, so φ(E[X])\varphi(\mathbb{E}[X]) (on the curve) sits below E[φ(X)]\mathbb{E}[\varphi(X)] (on the chord). The vertical gap is the slack in the inequality, and it closes only when φ\varphi is affine on the range of XX.

A single supporting line suffices here because the mean E[X]\mathbb{E}[X] is a fixed number, so we only need the line tangent at that one point. When the deterministic mean is replaced by the random variable E(XG)\mathbb{E}(X \mid \mathcal{G}), no single line works for all outcomes at once, and the argument has to invoke a whole countable family of supporting lines simultaneously. That refinement is carried out in the conditional version of Jensen’s inequality.