Least squares is orthogonal projection onto a column space. Regularization is projection with a soft constraint.
Linear regression is usually introduced as “fit a line through some points by minimizing the sum of squared errors.” That is true, but it hides the geometry that makes the whole thing make sense. The right picture is this: your data y is a single vector living in a high-dimensional space — one coordinate per observation. The columns of the design matrix X are also vectors in that space, and they span a subspace called the column space. Linear regression finds the closest point in the column space to y. That closest point is the orthogonal projection of y onto the subspace.
Once you see it as projection, every algebraic identity becomes obvious. The normal equations Xᵀ X β = Xᵀ y are saying exactly that the residual r = y − X β is orthogonal to every column of X. Ridge regression — adding λ I to Xᵀ X — is projection with a soft constraint that pulls β toward zero. Even the bias-variance tradeoff becomes a story about how big a subspace you allow the projection to land in.
Drag any blue point. The green line is the OLS fit; pink dashes are residuals (vertical distance from each point to the line). The line minimizes the sum of squared residuals.
The OLS solution comes from the normal equations Xᵀ X β = Xᵀ y. Drag a point far from the trend and watch the line tilt — outliers swing the slope hard because the loss is quadratic in the residuals.
Column space is the standard xy-plane. The projection of y is just (y₁, y₂, 0).
The white arrow is the data y. The two emerald arrows are the columns of X — they span the green plane. The amber arrow ŷ is the projection of y onto that plane; the pink dashed segment is the residual r = y − ŷ and meets the plane at a right angle. Notice ‖y‖² = ‖ŷ‖² + ‖r‖², the Pythagorean theorem in disguise.
Degree-9 polynomial fit to noisy data. The dashed white curve is the true generating function. As λ increases, ridge regression shrinks the wiggly coefficients toward zero — the green curve smooths from a manic overfit into a near-line.
Ridge replaces Xᵀ X with Xᵀ X + λ I in the normal equations. This is the closest point in the column space to y subject to a soft penalty on ‖β‖² — projection with a constraint. Slide λ from 10⁻⁸ to 10² and watch the high-degree coefficients collapse first; the model goes from memorizing noise to reproducing only what is reliably there.