For square-integrable random variables, conditional expectation has a clean geometric interpretation: E(X∣G) is the orthogonal projection of X onto the closed subspace L2(Ω,G,P) inside the Hilbert space L2(Ω,F,P). Two equivalent characterizations make this precise:
Orthogonality. The residual X−E(X∣G) is uncorrelated with every G-measurable square-integrable random variable.
Minimal distance. Among all G-measurable square-integrable Z, the choice Z=E(X∣G) minimizes the mean-squared error E[(X−Z)2].
Throughout this page, X has E[X2]<∞, so X∈L2(Ω,F,P). The inner product on L2 is ⟨X,Y⟩=E[XY], and the norm is ∥X∥2=E[X2]. Two zero-mean random variables are uncorrelated iff their inner product (covariance) is zero.
In the drawings below, I use dotted lines to denote something perpendicular to the plane and dashed lines to represent something within the plane.
L2(G) is the closed subspace of G-measurable square-integrable random variables. The projection E(X∣G) sits in this subspace; the residual X−E(X∣G) is perpendicular to it.
The reading: covariance is the inner product on the space of mean-zero L2 random variables. The proposition says the residual is orthogonal (uncorrelated) with the whole subspace L2(Ω,G,P). This is exactly the defining property of an orthogonal projection in Hilbert space.
Intuition. Any blue line (X−Z) is longer than the red perpendicular (X−E(X∣G)). A projection has to be at minimal distance.
Any alternative Z∈L2(G) produces a longer line X−Z than the perpendicular X−E(X∣G). The Pythagorean identity ∥X−Z∥22=∥X−E(X∣G)∥22+∥E(X∣G)−Z∥22 is the algebraic content of the picture.
The two characterizations (orthogonality, minimal distance) are equivalent statements of the Hilbert-space projection theorem specialized to the closed subspace L2(Ω,G,P)⊆L2(Ω,F,P). For any closed subspace M of a Hilbert space H and any X∈H:
A unique X^∈M minimizes ∥X−Z∥ over Z∈M.
This X^ is characterized by X−X^⊥M.
Conditional expectation realizes this projection concretely: X^=E(X∣G). The construction we gave via Radon-Nikodym handles all integrable X (not just X∈L2), but on the L2 subset it coincides with the projection, and most intuition transfers from the geometric picture.
A few corollaries that fall out immediately:
Variance decomposition. Taking Z=E[X] (the trivial-σ-field projection) in the Pythagorean identity gives
Var(X)=E[Var(X∣G)]+Var(E(X∣G)),
the law of total variance: unconditional variance splits into the conditional-variance average plus the variance of the conditional mean.
Best linear prediction. Restricting G to be the σ-field generated by a finite collection {Y1,…,Yk} and further restricting to linear combinations of the Yi recovers ordinary least-squares regression. The conditional expectation is the best predictor; OLS is the best linear predictor.
Idempotence. Projection is idempotent: applying E(⋅∣G) twice gives the same answer, which is the tower property restricted to L2.