Convolution as group action. CNNs are equivariant maps — representation theory inside every neural network.
Strip a convolutional neural network down to its mathematical skeleton and you find one operation: convolution. In continuous form, (f ∗ g)(x) = ∫ f(t) g(x − t) dt. Discretized, it is the dot product of a small kernel with a sliding window of the input. That is the only thing a convolutional layer does — repeatedly, with learned kernels, at every position.
What makes this operation special is a single algebraic identity. Translate the input by s, then convolve, and you get the same answer as convolving first and translating the output by s: Ts(f ∗ g) = (Tsf) ∗ g. A CNN layer is a translation-equivariant linear map. In the language of representation theory, convolution is the unique-up-to-scaling linear map that commutes with the action of the translation group ℤd on functions. Every other architecture choice in a CNN — pooling, padding, dilation — is a knob on this same group-theoretic structure.
Equivariance is not invariance. Equivariant means "the output transforms in a predictable way when the input transforms" — structure is preserved across the layer. Invariant means "the output does not change at all" — structure is discarded. Convolution is equivariant. Pooling adds a small dose of invariance. The full network composes them, and the field of geometric deep learninggeneralizes the same idea to other groups: rotations, permutations, gauges.
Click and drag on the input row to draw your own signal. The yellow window slides one position at a time, computing a dot product with the kernel.
At every position p, the kernel computes output[p] = Σₖ input[p+k] · kernel[k]. The derivative kernel [-1, 0, 1] spikes wherever the input changes — it is a discrete approximation of a first derivative. The blur kernel averages nearby values. Every CNN filter is one of these dot products, repeated at every position.
input (48×48)
kernel (3×3)
Cyan cells are positive weights (this pixel adds to the output), magenta are negative (this pixel subtracts). Edge kernels arrange positive and negative on opposite sides — the filter responds maximally to brightness changes in that direction.
output (46×46)
The same dot product as in 1D, just slid over a 2D grid. The output is smaller because the kernel cannot extend past the edges (we use 'valid' padding). Stack many 2D convolutions and you get a CNN — every layer is just the same translation-equivariant linear operation, with learned kernels.
Pipeline A → translate, then convolve
translated input
conv(translated)
Pipeline B → convolve, then translate
original input
translate(conv)
Move the shift sliders. The two pipelines give pixel-identical outputs in the interior (any tiny residual is from boundary clipping when the translation pushes content off the visible canvas). That is translation equivariance: Ts(f ∗ k) = (Tsf) ∗ k. It is the defining property of a CNN layer — and the reason convolution shows up in physics, signal processing, and ML wherever translations are a symmetry. See Representation Theory for the full algebraic picture.
feature map (8×8)
max pool (4×4)
Hover an output cell to see which input window it came from.
Invariance check → shift the input and re-pool
shifted feature map
max pool of shifted
Convolution is equivariant to translation — translate the input, the output translates the same way. Pooling adds a controlled amount of invariance: small input shifts often leave the pooled output (almost) unchanged, because the max in a 2×2 window doesn't care which of those 4 cells produced it. Equivariance preserves structure across layers; invariance discards what the network has decided not to care about. CNNs interleave both.