Linear Regression as Projection

Interactive: Drag-to-Fit Regression Line

A 2D scatter with the OLS line drawn live. The pink dashes are residuals — vertical distances from each point to the line. Drag any point and watch the line tilt to keep the sum of squared residuals minimal.

Drag any blue point. The green line is the OLS fit; pink dashes are residuals (vertical distance from each point to the line). The line minimizes the sum of squared residuals.

Show residualsn = 9

The OLS solution comes from the normal equations Xᵀ X β = Xᵀ y. Drag a point far from the trend and watch the line tilt — outliers swing the slope hard because the loss is quadratic in the residuals.

Interactive: Projection onto a Column Space

The full geometric picture in 3D. y is the white arrow; the two emerald arrows are the columns of X, spanning the green plane. The amber arrow ŷ is the projection; the pink dashed segment is the residual, meeting the plane at a right angle.

Column space is the standard xy-plane. The projection of y is just (y₁, y₂, 0).

Show column-space planeShow residualdrag to rotate, scroll to zoom

y = (1.60, 1.40, 2.20)

ŷ = (1.60, 1.40, 0.00)

r = (0.00, 0.00, 2.20)

‖y‖ = 3.059

‖ŷ‖ = 2.126

‖r‖ = 2.200

r · c₁ = 0.0000, r · c₂ = 0.0000

(both should be ≈ 0 — that's orthogonality)

The white arrow is the data y. The two emerald arrows are the columns of X — they span the green plane. The amber arrow ŷ is the projection of y onto that plane; the pink dashed segment is the residual r = y − ŷ and meets the plane at a right angle. Notice ‖y‖² = ‖ŷ‖² + ‖r‖², the Pythagorean theorem in disguise.

Interactive: Ridge Regression Tames a Degree-9 Polynomial

Same noisy data, same overcomplete polynomial basis. As you raise λ from 10⁻⁸ to 10², ridge shrinks the wiggly high-degree coefficients toward zero, smoothing the fit from a manic overfit into the signal underneath.

Degree-9 polynomial fit to noisy data. The dashed white curve is the true generating function. As λ increases, ridge regression shrinks the wiggly coefficients toward zero — the green curve smooths from a manic overfit into a near-line.

log₁₀ λ-6.0

Coefficient magnitudes |βₖ| (k = 1 … 9, scaled by max)

x^1

x^2

x^3

x^4

x^5

x^6

x^7

x^8

x^9

Ridge replaces Xᵀ X with Xᵀ X + λ I in the normal equations. This is the closest point in the column space to y subject to a soft penalty on ‖β‖² — projection with a constraint. Slide λ from 10⁻⁸ to 10² and watch the high-degree coefficients collapse first; the model goes from memorizing noise to reproducing only what is reliably there.

The math objects

Design matrix X: an n × p matrix whose columns are the predictor vectors. Each column is one feature, viewed across all n observations. The span of those columns is a p-dimensional subspace of Rⁿ — the column space.
Target y: a single vector in Rⁿ. The fitted ŷ = X β̂ is forced to live in the column space; the residual r = y − ŷ is whatever is left over.
Normal equations: Xᵀ X β = Xᵀ y. Rearranged, they say Xᵀ (y − X β) = 0 — the residual is orthogonal to every column of X. That is the geometric condition for projection.
Hat matrix: H = X (Xᵀ X)⁻¹ Xᵀ. It satisfies H² = H (idempotent) and Hᵀ = H (symmetric) — the algebraic fingerprint of an orthogonal projection. Then ŷ = H y.
Ridge solution: β̂_ridge = (Xᵀ X + λ I)⁻¹ Xᵀ y. The added λ I makes the system always invertible, even when X has near-collinear columns, and shrinks every coefficient toward zero by an amount that grows with λ.
Pythagorean identity: ‖y‖² = ‖ŷ‖² + ‖r‖². The total variance splits cleanly into “explained by the model” plus “orthogonal residual.” This is the source of R² and analysis of variance.

Interactive: Drag-to-Fit Regression Line

Interactive: Projection onto a Column Space

Interactive: Ridge Regression Tames a Degree-9 Polynomial

The math objects

Key takeaways