Skip to content

Introduction

In a basic probability course, conditional expectation is introduced through two formulas, one for each of the two settings that admit a clean elementary treatment.

For jointly discrete random variables X,YX, Y on a common probability space,

E(YX=x)=yyP(Y=yX=x),\E(Y \mid X = x) = \sum_y y \, \Pr(Y = y \mid X = x), P(Y=yX=x)=P(Y=y,X=x)P(X=x),\Pr(Y = y \mid X = x) = \frac{\Pr(Y = y, \, X = x)}{\Pr(X = x)},

valid whenever P(X=x)>0\Pr(X = x) > 0.

For jointly absolutely continuous X,YX, Y with joint density fX,Yf_{X, Y} and marginal density fXf_X,

E(YX=x)=yfYX(yx)dy,fYX(yx)=fX,Y(x,y)fX(x),\E(Y \mid X = x) = \int y \, f_{Y \mid X}(y \mid x) \, dy, \qquad f_{Y \mid X}(y \mid x) = \frac{f_{X, Y}(x, y)}{f_X(x)},

valid whenever fX(x)>0f_X(x) > 0.

These are the only two settings the elementary theory handles. They do not cover mixed pairs (e.g. XX discrete, YY continuous), pairs with a singular component (mass on a lower-dimensional set like the diagonal {X=Y}\{X = Y\}), or conditioning on more general information than the value of a single random variable.

How should conditional expectation be defined when neither of the elementary formulas applies?

Idea 1: build the conditional distribution first. Specify the law of YY given X=xX = x as a probability measure on R\R, then define E(YX=x)\E(Y \mid X = x) as the expectation under that measure. This is the route the elementary formulas suggest. Carrying it out rigorously in general is surprisingly hard: the regular conditional distribution exists under broad hypotheses (Borel-space targets), but constructing it requires substantial machinery and the resulting object is harder to manipulate than the expectation itself.

Idea 2: define conditional expectation directly. Skip the conditional distribution entirely and specify what E(YX)\E(Y \mid X) should be as a random variable, characterized by an integration identity. This is the approach we take. It is cleaner, more general (it handles any sub-σ\sigma-field, not just one generated by a random variable), and recovers the elementary formulas as special cases.

From a function of xx to a random variable

Section titled “From a function of xxx to a random variable”

The shift in viewpoint is the following. The elementary quantity E(YX=x)\E(Y \mid X = x) is a number for each fixed xx, so it defines a function g(x)=E(YX=x)g(x) = \E(Y \mid X = x). Composing with XX gives a random variable:

E(YX)(ω)  :=  g(X(ω))  =  E ⁣(YX=X(ω)).\E(Y \mid X)(\omega) \;:=\; g(X(\omega)) \;=\; \E\!\big( Y \mid X = X(\omega) \big).

The value of E(YX)\E(Y \mid X) at ω\omega depends on ω\omega only through X(ω)X(\omega). Equivalently, E(YX)\E(Y \mid X) is a function of XX, which is to say it is σ(X)\sigma(X)-measurable.

This is the structural feature to keep. Instead of conditioning on a specific value X=xX = x, the right thing to condition on is the information XX carries, namely σ(X)\sigma(X). And once expressed this way, σ(X)\sigma(X) can be replaced by any sub-σ\sigma-field GF\cG \subseteq \cF.

The two conditions package the two roles YY has to play:

  • (1) Measurability. YY is a function of the information in G\cG. It cannot resolve anything finer than what G\cG records.
  • (2) Integration identity. YY has the same total mass as XX on every event in G\cG, so averages over G\cG-events agree even though XX and YY are different random variables.

The identity in (2) does not hold for every AFA \in \cF (then Y=XY = X would be forced); it holds only for AGA \in \cG. This is what makes YY a genuine projection of XX onto G\cG rather than XX itself.

Suppose Ω\Omega is partitioned into three blocks A,B,CA, B, C, and G=σ({A,B,C})={,A,B,C,AB,AC,BC,Ω}\cG = \sigma(\{A, B, C\}) = \{\emptyset, A, B, C, A \cup B, A \cup C, B \cup C, \Omega\}. A G\cG-measurable function is one that cannot distinguish points within the same block, which forces it to be constant on each block. So E(XG)\E(X \mid \cG) is determined by a single number per block. By condition (1) it takes the form

E(XG)(ω)  =  cA1A(ω)+cB1B(ω)+cC1C(ω)\E(X \mid \cG)(\omega) \;=\; c_A \, \mathbb{1}_A(\omega) + c_B \, \mathbb{1}_B(\omega) + c_C \, \mathbb{1}_C(\omega)

for some constants cA,cB,cCc_A, c_B, c_C. Condition (2) applied to AA pins down cAc_A:

AXdP  =  AE(XG)dP  =  cAP(A)cA=1P(A)AXdP.\textcolor{#ef4444}{\int_A X \, d\Pr} \;=\; \textcolor{#3b82f6}{\int_A \E(X \mid \cG) \, d\Pr} \;=\; c_A \cdot \Pr(A) \quad \Longrightarrow \quad c_A = \frac{1}{\Pr(A)} \int_A X \, d\Pr.

The same argument with BB and CC gives cB=1P(B)BXdPc_B = \frac{1}{\Pr(B)} \int_B X \, d\Pr and cC=1P(C)CXdPc_C = \frac{1}{\Pr(C)} \int_C X \, d\Pr.

Each constant is the block average of XX over the corresponding block. The picture below renders the first two terms above as equal-area hatched regions in matching colors.

Conditional expectation as block average over a partition into A, B, C

On block AA, the red area under XX equals the blue rectangle under E(XG)\E(X \mid \cG). This is the integration identity (2).

E(XG)\E(X \mid \cG) is the best estimate of XX if you observe the universe only through G\cG.

The precise sense of best, and the sense in which E(XG)\E(X \mid \cG) is an orthogonal projection, is the content of the projection perspective. Before that, we need to know such a YY exists (existence) and is essentially unique (uniqueness). Both rest on the Radon-Nikodym theorem.