Skip to content

The Convolution Rule

Multiplying two circulant matrices multiplies their eigenvalues, and since those eigenvalues are the discrete Fourier transforms of the first columns, the columns themselves combine by convolution. Read in the other direction, this is the single most useful identity in signal processing: convolution in the original domain becomes ordinary pointwise multiplication in the Fourier domain. The same pattern, restricted to a short filter, is exactly what a convolutional layer of a neural network computes.

This is precisely the rule for multiplying circulants. If C\Cv and D\Dv are circulants with first columns c\cv and d\dv, then CD\Cv\Dv is again circulant, and its first column is cd\cv \circledast \dv. The reason is that the first column of any circulant is the matrix applied to the first basis vector e0\ev_0, and De0=d\Dv\ev_0 = \dv, so the first column of CD\Cv\Dv is Cd\Cv\dv, whose jj-th entry is mc(jm)modndm\sum_m c_{(j-m) \bmod n}\, d_m.

Because circulants share the Fourier eigenvectors, multiplying them multiplies their eigenvalues entry by entry. Translating that statement about eigenvalues back to the first columns gives the convolution rule.

The computational payoff is immediate. A direct cyclic convolution costs O(n2)O(n^2) multiplications. Through the rule, transform both vectors, multiply the transforms pointwise in O(n)O(n), and transform back. With the fast Fourier transform computing each Fx\Fv\xv in O(nlogn)O(n \log n), convolution of length-nn signals drops from O(n2)O(n^2) to O(nlogn)O(n \log n). Polynomial multiplication, integer multiplication, and the filtering of long signals all run on this identity.

A short filter h=(h0,,hr1)\hv = (h_0, \dots, h_{r-1}) applied to a signal x\xv usually means linear convolution, the coefficient rule for multiplying the polynomials with coefficients h\hv and x\xv:

(hx)j=mhmxjm,(\hv \ast \xv)_j = \sum_{m} h_m\, x_{j - m},

with no wraparound, producing a longer output of length n+r1n + r - 1. Cyclic convolution is the same sum folded modulo nn. The two agree, and the FFT method applies exactly, once both vectors are zero-padded to length at least n+r1n + r - 1 so that the wraparound never reaches real data. Without enough padding the tail of the linear convolution aliases back onto the front, the error that periodic boundary conditions silently introduce.

Visualize the rule directly: edit the signal and the filter, and compare the convolution computed by sliding the filter against the convolution recovered by multiplying transforms and inverting. They coincide, frequency by frequency.

signal
filter
space domain: slide the filter across the signal
signal x
filter h
=
output x ∗ h
frequency domain: the transforms multiply pointwise
frequency k
|F x|
×
frequency k
|F h|
=
frequency k
|F y|
The output spectrum is the input spectrum scaled frequency by frequency by the filter’s response: |F y|ₖ = |F x|ₖ · |F h|ₖ. A low-pass filter keeps the low frequencies and removes the high ones; the edge filter does the reverse. Convolving in space is multiplying in frequency.

Fixing the filter h\hv and letting the signal vary makes convolution a linear map, represented by a matrix with constant diagonals. With periodic boundaries it is the circulant whose first column is h\hv (zero-padded); on a finite signal without wraparound it is a banded Toeplitz matrix, (T)jk=hjk(\Tv)_{jk} = h_{j-k}. Either way the same filter weights repeat down every diagonal.

A short filter slides across the input, and the equivalent matrix is banded with the filter weights repeated on each diagonal; the same few weights are shared across all rows

The convolution matrix is banded with shared weights: a width-rr filter places the same rr numbers on every row, shifted by one. A dense layer would have an independent weight in every position.

This structure is what a convolutional neural network exploits. A convolutional layer is not a full dense matrix WRn×n\Wv \in \R^{n \times n} with n2n^2 free parameters; it is a Toeplitz (or, with periodic boundaries, circulant) matrix built from a small filter of rnr \ll n weights. Three consequences follow from the structure alone.

  • Parameter sharing. A width-rr filter has rr parameters regardless of the input length nn, against n2n^2 for a dense layer. A network can afford many such layers and many filters per layer.
  • Translation equivariance. Convolution commutes with the shift P\Pv, so shifting the input shifts the output identically. A feature detector learned at one location applies at every location, the property that makes these layers effective on images.
  • Locality. Each output depends only on a small window of the input, the bandwidth of the Toeplitz matrix, matching the local correlations in natural images and signals.

Two-dimensional images use a two-dimensional filter, and the corresponding matrix is doubly Toeplitz (a block-Toeplitz matrix of Toeplitz blocks), still diagonalized by the two-dimensional DFT under periodic boundaries. Stacking many such layers, interleaved with pointwise nonlinearities, is the architecture behind the convolutional networks used for image recognition. The linear algebra of one layer is entirely the convolution rule on this page.