Introduction

In a basic probability course, conditional expectation is introduced through two formulas, one for each of the two settings that admit a clean elementary treatment.

The elementary formulas

For jointly discrete random variables $X, Y$ on a common probability space,

\E(Y \mid X = x) = \sum_y y \, \Pr(Y = y \mid X = x),

\Pr(Y = y \mid X = x) = \frac{\Pr(Y = y, \, X = x)}{\Pr(X = x)},

valid whenever $\Pr(X = x) > 0$ .

For jointly absolutely continuous $X, Y$ with joint density $f_{X, Y}$ and marginal density $f_X$ ,

\E(Y \mid X = x) = \int y \, f_{Y \mid X}(y \mid x) \, dy, \qquad f_{Y \mid X}(y \mid x) = \frac{f_{X, Y}(x, y)}{f_X(x)},

valid whenever $f_X(x) > 0$ .

These are the only two settings the elementary theory handles. They do not cover mixed pairs (e.g. $X$ discrete, $Y$ continuous), pairs with a singular component (mass on a lower-dimensional set like the diagonal $\{X = Y\}$ ), or conditioning on more general information than the value of a single random variable.

Two routes to a general definition

How should conditional expectation be defined when neither of the elementary formulas applies?

Idea 1: build the conditional distribution first. Specify the law of $Y$ given $X = x$ as a probability measure on $\R$ , then define $\E(Y \mid X = x)$ as the expectation under that measure. This is the route the elementary formulas suggest. Carrying it out rigorously in general is surprisingly hard: the regular conditional distribution exists under broad hypotheses (Borel-space targets), but constructing it requires substantial machinery and the resulting object is harder to manipulate than the expectation itself.

Idea 2: define conditional expectation directly. Skip the conditional distribution entirely and specify what $\E(Y \mid X)$ should be as a random variable, characterized by an integration identity. This is the approach we take. It is cleaner, more general (it handles any sub- $\sigma$ -field, not just one generated by a random variable), and recovers the elementary formulas as special cases.

From a function of $x$ to a random variable

The shift in viewpoint is the following. The elementary quantity $\E(Y \mid X = x)$ is a number for each fixed $x$ , so it defines a function $g(x) = \E(Y \mid X = x)$ . Composing with $X$ gives a random variable:

\E(Y \mid X)(\omega) \;:=\; g(X(\omega)) \;=\; \E\!\big( Y \mid X = X(\omega) \big).

The value of $\E(Y \mid X)$ at $\omega$ depends on $\omega$ only through $X(\omega)$ . Equivalently, $\E(Y \mid X)$ is a function of $X$ , which is to say it is $\sigma(X)$ -measurable.

This is the structural feature to keep. Instead of conditioning on a specific value $X = x$ , the right thing to condition on is the information $X$ carries, namely $\sigma(X)$ . And once expressed this way, $\sigma(X)$ can be replaced by any sub- $\sigma$ -field $\cG \subseteq \cF$ .

The general definition

The two conditions package the two roles $Y$ has to play:

(1) Measurability. $Y$ is a function of the information in $\cG$ . It cannot resolve anything finer than what $\cG$ records.
(2) Integration identity. $Y$ has the same total mass as $X$ on every event in $\cG$ , so averages over $\cG$ -events agree even though $X$ and $Y$ are different random variables.

The identity in (2) does not hold for every $A \in \cF$ (then $Y = X$ would be forced); it holds only for $A \in \cG$ . This is what makes $Y$ a genuine projection of $X$ onto $\cG$ rather than $X$ itself.

A picture

Suppose $\Omega$ is partitioned into three blocks $A, B, C$ , and $\cG = \sigma(\{A, B, C\}) = \{\emptyset, A, B, C, A \cup B, A \cup C, B \cup C, \Omega\}$ . A $\cG$ -measurable function is one that cannot distinguish points within the same block, which forces it to be constant on each block. So $\E(X \mid \cG)$ is determined by a single number per block. By condition (1) it takes the form

\E(X \mid \cG)(\omega) \;=\; c_A \, \mathbb{1}_A(\omega) + c_B \, \mathbb{1}_B(\omega) + c_C \, \mathbb{1}_C(\omega)

for some constants $c_A, c_B, c_C$ . Condition (2) applied to $A$ pins down $c_A$ :

\textcolor{#ef4444}{\int_A X \, d\Pr} \;=\; \textcolor{#3b82f6}{\int_A \E(X \mid \cG) \, d\Pr} \;=\; c_A \cdot \Pr(A) \quad \Longrightarrow \quad c_A = \frac{1}{\Pr(A)} \int_A X \, d\Pr.

The same argument with $B$ and $C$ gives $c_B = \frac{1}{\Pr(B)} \int_B X \, d\Pr$ and $c_C = \frac{1}{\Pr(C)} \int_C X \, d\Pr$ .

Each constant is the block average of $X$ over the corresponding block. The picture below renders the first two terms above as equal-area hatched regions in matching colors.

Conditional expectation as block average over a partition into A, B, C

On block $A$ , the red area under $X$ equals the blue rectangle under $\E(X \mid \cG)$ . This is the integration identity (2).

$\E(X \mid \cG)$ is the best estimate of $X$ if you observe the universe only through $\cG$ .

The precise sense of best, and the sense in which $\E(X \mid \cG)$ is an orthogonal projection, is the content of the projection perspective. Before that, we need to know such a $Y$ exists (existence) and is essentially unique (uniqueness). Both rest on the Radon-Nikodym theorem.