Kernels & Feature Spaces

Positive-definite kernels are inner products in implicit Hilbert spaces. The kernel trick makes nonlinear linear.

Kernels & Feature Spaces

A kernel is a function K(x, y) that measures similarity between two inputs. Mercer's theorem says that whenever K is symmetric and positive-definite, there is some feature map φ — possibly into an infinite-dimensional space — for which K(x, y) = ⟨φ(x), φ(y)⟩. The kernel is literally an inner product in that implicit space.

The kernel trick is the engineering payoff: any algorithm that touches its inputs only through inner products (linear regression, SVMs, PCA, ridge regression) can be run inside the lifted feature space without ever computing φ. You replace every xᵀy with a K(x, y) call and inherit a nonlinear method for free. Three kernels cover most of the territory: linear (no lift), polynomial (1 + xᵀy)^d (finite lift to all monomials of degree ≤ d), and the RBF exp(−γ‖x − y‖²) (lift to an infinite-dimensional space of radial basis functions).

Interactive: Kernel Decision Boundary

Train a kernel-ridge classifier on a nonlinearly separable 2D dataset. Switch kernels and watch the boundary morph from a straight line to circles to lobes — all from the same inner-product machinery.

γ2.0λ (ridge)5.0e-2

accuracy100%

We train a kernel-ridge classifier α = (K + λI)⁻¹ y on the displayed points. The coloured background is the continuous decision function f(x); the white curve is f(x) = 0. Switch from linear to polynomial or RBF and the boundary morphs from a straight line into circles, lobes, and contours — all without ever computing φ explicitly.

Interactive: φ-Map Visualizer

Apply the explicit feature map φ(x, y) = (x, y, x² + y²) to concentric-circle data. Slide the lift from 0 to 1: the two classes that overlap radially in 2D separate into different heights in 3D, where a flat plane splits them.

lift t1.00plane height1.50Show separating plane

Slide lift from 0 to 1 to apply the explicit feature map φ(x, y) = (x, y, x² + y²). At t = 0 the two concentric-circle classes lie flat and overlap radially — no straight line in 2D separates them. At t = 1 they sit at different heights, and a horizontal plane (the linear classifier in the lifted space) cleanly slices them apart. This is the geometric content of the kernel trick: nonlinear in 2D = linear in 3D.

Interactive: Kernel Matrix Viewer

Pick a kernel and a small set of points; the n×n Gram matrix K_ij = K(xᵢ, xⱼ) is the kernel's entire view of your data. Different kernels paint very different similarity geometries.

Points

Kernel matrix K (n = 16)

γ2.0

RBF γ = 2.00: K(x, y) = exp(−γ‖x − y‖²)

Hover any cell in the matrix — the corresponding pair of points lights up on the left.

Every kernel is a similarity score on pairs of inputs. Switch kernels and the matrix repaints: linear is dominated by the rank-2 outer product xᵀy, polynomial sharpens it nonlinearly, and RBF is a band-diagonal glow that decays with distance. Same points, three different geometries — none of which require knowing φ.

The math objects

Positive-definite kernel: a symmetric K such that the Gram matrix [K(xᵢ, xⱼ)] is positive semi-definite for any finite sample. This is the structural condition that makes K an inner product on some Hilbert space.
Feature map φ: the embedding into the implicit space. For polynomial degree 2 in 2D, φ has six coordinates (1, √2 x, √2 y, x², y², √2 xy). For RBF, φ is infinite-dimensional — and yet K(x, y) is a single exponential.
Reproducing-kernel Hilbert space (RKHS): the completion of the span of {K(·, x)} with the inner product ⟨K(·, x), K(·, y)⟩ = K(x, y). Every kernel induces a unique RKHS; every kernel method is linear regression inside that RKHS.
Kernel ridge regression: α = (K + λI)⁻¹ y solves the regularized least-squares problem in the feature space. The prediction f(x) = Σⱼ αⱼ K(x, xⱼ) is a linear combination of kernel evaluations against the training set — no feature vectors ever materialise.
The kernel trick: wherever an algorithm uses ⟨x, y⟩ on raw inputs, replace it with K(x, y). The algorithm now runs in the (potentially infinite-dimensional) feature space at the cost of an n×n matrix in the dataset size, not the feature dimension.

Key takeaways

A positive-definite kernel is an inner product in an implicit feature space: K(x, y) = ⟨φ(x), φ(y)⟩.
The kernel trick swaps every dot product for a kernel call — nonlinear algorithms emerge from linear ones.
Linear, polynomial (1 + xᵀy)^d, and RBF exp(−γ‖x − y‖²) cover most practical needs.
The Gram matrix K is the kernel's entire view of your data; algorithms depend on the data only through K.
Kernel ridge regression — α = (K + λI)⁻¹ y — gives a flexible nonlinear classifier from one matrix solve.