Introduction
In a basic probability course, conditional expectation is introduced through two formulas, one for each of the two settings that admit a clean elementary treatment.
The elementary formulas
Section titled “The elementary formulas”For jointly discrete random variables on a common probability space,
valid whenever .
For jointly absolutely continuous with joint density and marginal density ,
valid whenever .
These are the only two settings the elementary theory handles. They do not cover mixed pairs (e.g. discrete, continuous), pairs with a singular component (mass on a lower-dimensional set like the diagonal ), or conditioning on more general information than the value of a single random variable.
Two routes to a general definition
Section titled “Two routes to a general definition”How should conditional expectation be defined when neither of the elementary formulas applies?
Idea 1: build the conditional distribution first. Specify the law of given as a probability measure on , then define as the expectation under that measure. This is the route the elementary formulas suggest. Carrying it out rigorously in general is surprisingly hard: the regular conditional distribution exists under broad hypotheses (Borel-space targets), but constructing it requires substantial machinery and the resulting object is harder to manipulate than the expectation itself.
Idea 2: define conditional expectation directly. Skip the conditional distribution entirely and specify what should be as a random variable, characterized by an integration identity. This is the approach we take. It is cleaner, more general (it handles any sub--field, not just one generated by a random variable), and recovers the elementary formulas as special cases.
From a function of to a random variable
Section titled “From a function of xxx to a random variable”The shift in viewpoint is the following. The elementary quantity is a number for each fixed , so it defines a function . Composing with gives a random variable:
The value of at depends on only through . Equivalently, is a function of , which is to say it is -measurable.
This is the structural feature to keep. Instead of conditioning on a specific value , the right thing to condition on is the information carries, namely . And once expressed this way, can be replaced by any sub--field .
The general definition
Section titled “The general definition”The two conditions package the two roles has to play:
- (1) Measurability. is a function of the information in . It cannot resolve anything finer than what records.
- (2) Integration identity. has the same total mass as on every event in , so averages over -events agree even though and are different random variables.
The identity in (2) does not hold for every (then would be forced); it holds only for . This is what makes a genuine projection of onto rather than itself.
A picture
Section titled “A picture”Suppose is partitioned into three blocks , and . A -measurable function is one that cannot distinguish points within the same block, which forces it to be constant on each block. So is determined by a single number per block. By condition (1) it takes the form
for some constants . Condition (2) applied to pins down :
The same argument with and gives and .
Each constant is the block average of over the corresponding block. The picture below renders the first two terms above as equal-area hatched regions in matching colors.
On block , the red area under equals the blue rectangle under . This is the integration identity (2).
is the best estimate of if you observe the universe only through .
The precise sense of best, and the sense in which is an orthogonal projection, is the content of the projection perspective. Before that, we need to know such a exists (existence) and is essentially unique (uniqueness). Both rest on the Radon-Nikodym theorem.