Convolutions & Equivariance

Convolution as group action. CNNs are equivariant maps — representation theory inside every neural network.

Convolutions & Equivariance

Strip a convolutional neural network down to its mathematical skeleton and you find one operation: convolution. In continuous form, (f ∗ g)(x) = ∫ f(t) g(x − t) dt. Discretized, it is the dot product of a small kernel with a sliding window of the input. That is the only thing a convolutional layer does — repeatedly, with learned kernels, at every position.

What makes this operation special is a single algebraic identity. Translate the input by s, then convolve, and you get the same answer as convolving first and translating the output by s: Ts(f ∗ g) = (Tsf) ∗ g. A CNN layer is a translation-equivariant linear map. In the language of representation theory, convolution is the unique-up-to-scaling linear map that commutes with the action of the translation group ℤd on functions. Every other architecture choice in a CNN — pooling, padding, dilation — is a knob on this same group-theoretic structure.

Equivariance is not invariance. Equivariant means "the output transforms in a predictable way when the input transforms" — structure is preserved across the layer. Invariant means "the output does not change at all" — structure is discarded. Convolution is equivariant. Pooling adds a small dose of invariance. The full network composes them, and the field of geometric deep learninggeneralizes the same idea to other groups: rotations, permutations, gauges.

Interactive: 1D Sliding Window

Draw a signal, pick a kernel, and watch the kernel slide across one position at a time. The output at each position is a single dot product. Try the derivative kernel — it spikes wherever the input changes.

Click and drag on the input row to draw your own signal. The yellow window slides one position at a time, computing a dot product with the kernel.

At every position p, the kernel computes output[p] = Σₖ input[p+k] · kernel[k]. The derivative kernel [-1, 0, 1] spikes wherever the input changes — it is a discrete approximation of a first derivative. The blur kernel averages nearby values. Every CNN filter is one of these dot products, repeated at every position.

Interactive: 2D Image Filter

The same dot product, run over every position of a small grayscale image. Each kernel is a tiny pattern that the filter responds to: edges, blurs, sharpens. Stack many of these and you have a CNN.

input (48×48)

kernel (3×3)

-1
0
1
-2
0
2
-1
0
1

Cyan cells are positive weights (this pixel adds to the output), magenta are negative (this pixel subtracts). Edge kernels arrange positive and negative on opposite sides — the filter responds maximally to brightness changes in that direction.

output (46×46)

The same dot product as in 1D, just slid over a 2D grid. The output is smaller because the kernel cannot extend past the edges (we use 'valid' padding). Stack many 2D convolutions and you get a CNN — every layer is just the same translation-equivariant linear operation, with learned kernels.

Interactive: Translation Equivariance

Two pipelines, same final result. Translate-then-convolve equals convolve-then-translate. This is the algebraic identity that makes CNNs the right architecture for images — every learned filter automatically generalizes across the whole spatial domain.

Pipeline A  →  translate, then convolve

translated input

conv(translated)

Pipeline B  →  convolve, then translate

original input

translate(conv)

interior residual: 6.00e-1

Move the shift sliders. The two pipelines give pixel-identical outputs in the interior (any tiny residual is from boundary clipping when the translation pushes content off the visible canvas). That is translation equivariance: Ts(f ∗ k) = (Tsf) ∗ k. It is the defining property of a CNN layer — and the reason convolution shows up in physics, signal processing, and ML wherever translations are a symmetry. See Representation Theory for the full algebraic picture.

Interactive: Pooling and Local Invariance

Max pooling keeps only the strongest activation in each window. The result is a smaller feature map that hardly changes when you nudge the input — pooling deliberately throws away precise spatial information in exchange for invariance.

feature map (8×8)

max pool (4×4)

Hover an output cell to see which input window it came from.

Invariance check  →  shift the input and re-pool

shifted feature map

max pool of shifted

raw mean |diff|: 0.130
pooled mean |diff|: 0.130

Convolution is equivariant to translation — translate the input, the output translates the same way. Pooling adds a controlled amount of invariance: small input shifts often leave the pooled output (almost) unchanged, because the max in a 2×2 window doesn't care which of those 4 cells produced it. Equivariance preserves structure across layers; invariance discards what the network has decided not to care about. CNNs interleave both.

The math objects

  • Convolution: (f ∗ g)(x) = Σt f(t) g(x − t). In every deep learning library, the operation is actually cross-correlation (no kernel flip), but the algebraic property of equivariance is the same.
  • Translation group ℤd: the abelian group of integer shifts on a d-dimensional grid. It acts on the space of functions f : ℤd → ℝ by (Tsf)(x) = f(x − s).
  • Equivariance: a map L is G-equivariant if L ∘ Tg = Tg ∘ L for every g ∈ G. Convolution is the prototypical example for the translation group. Schur's lemma plus a Fourier argument shows that every translation-equivariant linear map on functions is a convolution.
  • Equivariance vs invariance: equivariance means structure is preserved (the output transforms predictably). Invariance means structure is discarded (the output does not transform at all). A pooled-then-classified CNN goes equivariant → equivariant → ... → invariant.
  • Geometric deep learning: swap the translation group for rotations (SO(2), SO(3)), permutations (graph neural networks), or gauge transformations (mesh CNNs). The corresponding equivariant linear maps are no longer ordinary convolutions, but the recipe — pick your group, find its equivariant maps — is the same.

Key takeaways

  • Convolution is the dot product of a kernel with every sliding window of the input — the same operation in 1D, 2D, and beyond.
  • A CNN layer is the unique linear map (up to choice of kernel) that is equivariant under translations: Ts(f ∗ k) = (Tsf) ∗ k.
  • Equivariance preserves structure. Invariance discards it. Convolutions are equivariant; pooling moves the network toward invariance.
  • Generalize to other groups (rotations, permutations) and you get the field of geometric deep learning. The mathematics is representation theory.