Skip to content

Randomized Linear Algebra

When a matrix is too large to factor in full, sampling its rows and columns gives an unbiased estimator of products and an extremely effective approximation of the SVD. The key fact is that, with the right sampling distribution, the variance of the estimator is controlled by the same Frobenius norms that already govern Eckart–Young accuracy.

Write the product AB\Av\Bv as a sum of rank-one terms,

AB  =  i=1naibiT,\Av\Bv \;=\; \sum_{i=1}^n \av_i \, \bv_i^{\rm T},

where ai\av_i is the ii-th column of A\Av and biT\bv_i^{\rm T} is the ii-th row of B\Bv. Replace the full sum by a sample.

Each summand has expectation iqiaibiT/qi=AB\sum_i q_i \cdot \av_i\bv_i^{\rm T}/q_i = \Av\Bv, so AB~\widetilde{\Av\Bv} is unbiased. Its mean-square Frobenius error is the variance of the estimator,

E ⁣[AB~ABF2]  =  1p(i=1nai2bi2qi    ABF2),\E\!\left[\,\lVert \widetilde{\Av\Bv} - \Av\Bv \rVert_F^2\,\right] \;=\; \frac{1}{p}\left( \sum_{i=1}^n \frac{\lVert \av_i \rVert^2\,\lVert \bv_i \rVert^2}{q_i} \;-\; \lVert \Av\Bv \rVert_F^2 \right),

and is minimized over the choice of qiq_i by importance sampling proportional to column–row magnitudes:

qi  =  aibijajbj.q_i^{\star} \;=\; \frac{\lVert \av_i \rVert \, \lVert \bv_i \rVert}{\sum_j \lVert \av_j \rVert \, \lVert \bv_j \rVert}.

With this choice, the bound becomes E ⁣[AB~ABF2]1p(iaibi)2\E\!\left[\lVert \widetilde{\Av\Bv} - \Av\Bv \rVert_F^2\right] \le \tfrac{1}{p}\big(\sum_i \lVert \av_i\rVert\,\lVert \bv_i\rVert\big)^2, depending only on the total Frobenius mass of the factors. Matrices with a few dominant columns concentrate that mass, and a small pp suffices.

Many problems do not need the full product, only a good basis for the column space of A\Av. Hit A\Av with a small random matrix.

The constant 11(k+)min(m,n)11\sqrt{(k+\ell)\min(m,n)} is the simple worst-case bound; the expected error is much smaller, often only k\sqrt{k\ell} above the optimum. A single extra step of power iteration, replacing Y=AΩ\Yv = \Av\Omegav by Y=(AAT)qAΩ\Yv = (\Av\Av^{\rm T})^q \Av\Omegav, polynomially shrinks the gap whenever A\Av has any decay in its singular values.

Combining the two ideas gives the fast approximate SVD of a large matrix.

The expensive step is the matrix multiplication AΩ\Av\Omegav, which touches each entry of A\Av once and can be streamed or done in parallel; the rest works on tall-thin matrices of size m×(k+)m \times (k+\ell) or (k+)×n(k+\ell) \times n. For an m×nm \times n matrix the cost drops from O(mnmin(m,n))O(mn\min(m,n)) for a full SVD to roughly O(mn(k+))O(mn(k+\ell)), with error close to the Eckart–Young optimum σk+1\sigma_{k+1}. This is what lets PCA, latent-semantic indexing, and modern recommendation systems run on matrices that would never fit in memory in factored form.