Convolutions & Equivariance

Interactive: 1D Sliding Window

Draw a signal, pick a kernel, and watch the kernel slide across one position at a time. The output at each position is a single dot product. Try the derivative kernel — it spikes wherever the input changes.

Click and drag on the input row to draw your own signal. The yellow window slides one position at a time, computing a dot product with the kernel.

pos0

At every position p, the kernel computes output[p] = Σₖ input[p+k] · kernel[k]. The derivative kernel [-1, 0, 1] spikes wherever the input changes — it is a discrete approximation of a first derivative. The blur kernel averages nearby values. Every CNN filter is one of these dot products, repeated at every position.

Interactive: 2D Image Filter

The same dot product, run over every position of a small grayscale image. Each kernel is a tiny pattern that the filter responds to: edges, blurs, sharpens. Stack many of these and you have a CNN.

input (48×48)

kernel (3×3)

-1

0

1

-2

0

2

-1

0

1

Cyan cells are positive weights (this pixel adds to the output), magenta are negative (this pixel subtracts). Edge kernels arrange positive and negative on opposite sides — the filter responds maximally to brightness changes in that direction.

output (46×46)

The same dot product as in 1D, just slid over a 2D grid. The output is smaller because the kernel cannot extend past the edges (we use 'valid' padding). Stack many 2D convolutions and you get a CNN — every layer is just the same translation-equivariant linear operation, with learned kernels.

Interactive: Translation Equivariance

Two pipelines, same final result. Translate-then-convolve equals convolve-then-translate. This is the algebraic identity that makes CNNs the right architecture for images — every learned filter automatically generalizes across the whole spatial domain.

Pipeline A → translate, then convolve

translated input

conv(translated)

Pipeline B → convolve, then translate

original input

translate(conv)

shift x8shift y0

interior residual: 6.00e-1

Move the shift sliders. The two pipelines give pixel-identical outputs in the interior (any tiny residual is from boundary clipping when the translation pushes content off the visible canvas). That is translation equivariance: T_s(f ∗ k) = (T_sf) ∗ k. It is the defining property of a CNN layer — and the reason convolution shows up in physics, signal processing, and ML wherever translations are a symmetry. See Representation Theory for the full algebraic picture.

Interactive: Pooling and Local Invariance

Max pooling keeps only the strongest activation in each window. The result is a smaller feature map that hardly changes when you nudge the input — pooling deliberately throws away precise spatial information in exchange for invariance.

size2stride2

feature map (8×8)

max pool (4×4)

Hover an output cell to see which input window it came from.

Invariance check → shift the input and re-pool

shifted feature map

max pool of shifted

shift x1

raw mean |diff|: 0.130

pooled mean |diff|: 0.130

Convolution is equivariant to translation — translate the input, the output translates the same way. Pooling adds a controlled amount of invariance: small input shifts often leave the pooled output (almost) unchanged, because the max in a 2×2 window doesn't care which of those 4 cells produced it. Equivariance preserves structure across layers; invariance discards what the network has decided not to care about. CNNs interleave both.

The math objects

Convolution: (f ∗ g)(x) = Σ_t f(t) g(x − t). In every deep learning library, the operation is actually cross-correlation (no kernel flip), but the algebraic property of equivariance is the same.
Translation group ℤ^d: the abelian group of integer shifts on a d-dimensional grid. It acts on the space of functions f : ℤ^d → ℝ by (T_sf)(x) = f(x − s).
Equivariance: a map L is G-equivariant if L ∘ T_g = T_g ∘ L for every g ∈ G. Convolution is the prototypical example for the translation group. Schur's lemma plus a Fourier argument shows that every translation-equivariant linear map on functions is a convolution.
Equivariance vs invariance: equivariance means structure is preserved (the output transforms predictably). Invariance means structure is discarded (the output does not transform at all). A pooled-then-classified CNN goes equivariant → equivariant → ... → invariant.
Geometric deep learning: swap the translation group for rotations (SO(2), SO(3)), permutations (graph neural networks), or gauge transformations (mesh CNNs). The corresponding equivariant linear maps are no longer ordinary convolutions, but the recipe — pick your group, find its equivariant maps — is the same.

Interactive: 1D Sliding Window

Interactive: 2D Image Filter

Interactive: Translation Equivariance

Interactive: Pooling and Local Invariance

The math objects

Key takeaways