Decomposing Variance

The law of total variance decomposes $\Var(X)$ into two pieces driven by an auxiliary random variable $Y$ : the average within- $Y$ variance and the variance of the conditional mean $\E[X \mid Y]$ . It is the second-moment analog of the law of iterated expectations and a direct corollary of the orthogonality of the residual.

Conditional variance

This is the random variable obtained by squaring the residual $X - \E[X \mid Y]$ and taking its conditional expectation given $Y$ . It is a function of $Y$ and is itself a random variable, just like $\E[X \mid Y]$ .

For random variables $X, Y$ with $\E[X^2] < \infty$ ,

\Var(X) \;=\; \E\!\big[ \Var(X \mid Y) \big] + \Var\!\big( \E[X \mid Y] \big).

Intuition: Within-group and between-group

Imagine $X$ is a measurement (height, income, response time) and $Y$ is a grouping variable (school, region, treatment arm). The law of total variance reads:

The total spread of $X$ equals the average spread within each group plus the spread of the group averages.

Total variance splits into within-group (intra) and between-group (inter) variance — On the left, all data points pooled together: spread is $\Var(X)$ . On the right, the same points grouped by $Y$ : the orange bars show the within-group spread averaged across groups $\E[\Var(X \mid Y)]$ , and the purple bar shows the spread of the group means $\Var(\E[X \mid Y])$ . The two pieces sum back to the total variance on the left.

Concretely:

Within-group (intra) variance = $\E[\Var(X \mid Y)]$ . The average of the variances inside each $Y$ -block. Captures how much $X$ wiggles around its conditional mean.
Between-group (inter) variance = $\Var(\E[X \mid Y])$ . The variance of the conditional means $\E[X \mid Y]$ . Captures how much the group means themselves spread out.

Applications

Analysis of variance (ANOVA). The within/between decomposition is the algebraic core of one-way ANOVA, where $Y$ is a categorical treatment label and the F-statistic compares the two pieces.
Variance reduction. If $\E[X \mid Y]$ is easy to compute and $\Var(X \mid Y)$ is small, conditioning on $Y$ gives a low-variance estimator. This is the basis for Rao-Blackwellization in statistics and stratified sampling in Monte Carlo.
Regression decomposition. With $Y$ replaced by a fitted regression $\hat X = f(\mathbf{Z})$ , the same identity gives the standard “explained vs. unexplained” variance decomposition. The $R^2$ statistic is the ratio of $\Var(\E[X \mid \mathbf{Z}])$ to $\Var(X)$ , restricted to the best linear predictor.