Linear Regression as Projection

Least squares is orthogonal projection onto a column space. Regularization is projection with a soft constraint.

Linear Regression as Projection

Linear regression is usually introduced as “fit a line through some points by minimizing the sum of squared errors.” That is true, but it hides the geometry that makes the whole thing make sense. The right picture is this: your data y is a single vector living in a high-dimensional space — one coordinate per observation. The columns of the design matrix X are also vectors in that space, and they span a subspace called the column space. Linear regression finds the closest point in the column space to y. That closest point is the orthogonal projection of y onto the subspace.

Once you see it as projection, every algebraic identity becomes obvious. The normal equations Xᵀ X β = Xᵀ y are saying exactly that the residual r = y − X β is orthogonal to every column of X. Ridge regression — adding λ I to Xᵀ X — is projection with a soft constraint that pulls β toward zero. Even the bias-variance tradeoff becomes a story about how big a subspace you allow the projection to land in.

Interactive: Drag-to-Fit Regression Line

A 2D scatter with the OLS line drawn live. The pink dashes are residuals — vertical distances from each point to the line. Drag any point and watch the line tilt to keep the sum of squared residuals minimal.

Drag any blue point. The green line is the OLS fit; pink dashes are residuals (vertical distance from each point to the line). The line minimizes the sum of squared residuals.

n = 9

The OLS solution comes from the normal equations Xᵀ X β = Xᵀ y. Drag a point far from the trend and watch the line tilt — outliers swing the slope hard because the loss is quadratic in the residuals.

Interactive: Projection onto a Column Space

The full geometric picture in 3D. y is the white arrow; the two emerald arrows are the columns of X, spanning the green plane. The amber arrow ŷ is the projection; the pink dashed segment is the residual, meeting the plane at a right angle.

Column space is the standard xy-plane. The projection of y is just (y₁, y₂, 0).

drag to rotate, scroll to zoom
y = (1.60, 1.40, 2.20)
ŷ = (1.60, 1.40, 0.00)
r = (0.00, 0.00, 2.20)
‖y‖ = 3.059
‖ŷ‖ = 2.126
‖r‖ = 2.200
r · c₁ = 0.0000, r · c₂ = 0.0000
(both should be ≈ 0 — that's orthogonality)

The white arrow is the data y. The two emerald arrows are the columns of X — they span the green plane. The amber arrow ŷ is the projection of y onto that plane; the pink dashed segment is the residual r = y − ŷ and meets the plane at a right angle. Notice ‖y‖² = ‖ŷ‖² + ‖r‖², the Pythagorean theorem in disguise.

Interactive: Ridge Regression Tames a Degree-9 Polynomial

Same noisy data, same overcomplete polynomial basis. As you raise λ from 10⁻⁸ to 10², ridge shrinks the wiggly high-degree coefficients toward zero, smoothing the fit from a manic overfit into the signal underneath.

Degree-9 polynomial fit to noisy data. The dashed white curve is the true generating function. As λ increases, ridge regression shrinks the wiggly coefficients toward zero — the green curve smooths from a manic overfit into a near-line.

Coefficient magnitudes |βₖ| (k = 1 … 9, scaled by max)
x^1
x^2
x^3
x^4
x^5
x^6
x^7
x^8
x^9

Ridge replaces Xᵀ X with Xᵀ X + λ I in the normal equations. This is the closest point in the column space to y subject to a soft penalty on ‖β‖² — projection with a constraint. Slide λ from 10⁻⁸ to 10² and watch the high-degree coefficients collapse first; the model goes from memorizing noise to reproducing only what is reliably there.

The math objects

  • Design matrix X: an n × p matrix whose columns are the predictor vectors. Each column is one feature, viewed across all n observations. The span of those columns is a p-dimensional subspace of Rⁿ — the column space.
  • Target y: a single vector in Rⁿ. The fitted ŷ = X β̂ is forced to live in the column space; the residual r = y − ŷ is whatever is left over.
  • Normal equations: Xᵀ X β = Xᵀ y. Rearranged, they say Xᵀ (y − X β) = 0 — the residual is orthogonal to every column of X. That is the geometric condition for projection.
  • Hat matrix: H = X (Xᵀ X)⁻¹ Xᵀ. It satisfies H² = H (idempotent) and Hᵀ = H (symmetric) — the algebraic fingerprint of an orthogonal projection. Then ŷ = H y.
  • Ridge solution: β̂_ridge = (Xᵀ X + λ I)⁻¹ Xᵀ y. The added λ I makes the system always invertible, even when X has near-collinear columns, and shrinks every coefficient toward zero by an amount that grows with λ.
  • Pythagorean identity: ‖y‖² = ‖ŷ‖² + ‖r‖². The total variance splits cleanly into “explained by the model” plus “orthogonal residual.” This is the source of R² and analysis of variance.

Key takeaways

  • The fitted vector ŷ is the orthogonal projection of y onto the column space of X.
  • The normal equations Xᵀ X β = Xᵀ y are the algebraic statement “the residual is orthogonal to every column.”
  • ‖y‖² = ‖ŷ‖² + ‖r‖² — the model and the residual decompose Pythagorean-style.
  • Ridge regression replaces Xᵀ X with Xᵀ X + λ I — projection with a soft penalty that shrinks coefficients.
  • Bias-variance tradeoff is a story about the size of the column space the projection is allowed to use.