Loss Landscapes & Gradient Descent

A loss function is a surface — training is rolling down it. Compare gradient descent, momentum, and Newton on real landscapes.

Loss Landscapes & Gradient Descent

Training a machine learning model means choosing parameters that make some loss function L as small as possible. If your model has two parameters, the loss is a surface over the plane — a landscape, with hills, valleys, ridges, and saddles. Training is the act of rolling a ball down that surface. The math is the geometry of the surface; the algorithm is the rolling rule.

The most basic rule is gradient descent: at each step, take the direction of steepest descent — the negative gradient of L — and move a small distance in that direction. The gradient is the local first derivative; it tells you only what is happening immediately around your current point. Curvature, the second derivative, is encoded in the Hessian. Different optimizers use different amounts of this geometric information, and that choice determines whether you smoothly reach the minimum, oscillate, escape a saddle, or diverge.

Interactive: Loss Surface Explorer

Pick a surface, click anywhere to set a starting point, and run gradient descent. The four surfaces — bowl, Rosenbrock, saddle, Himmelblau — each break gradient descent in a different way.

A stretched quadratic. Gradient descent zig-zags down the long axis.

Click anywhere to set the starting point. Press Run to descend along the gradient. The dashed crosshair marks a known minimum (when one exists). Try setting the learning rate too high — gradient descent will overshoot or diverge.

Interactive: Optimizer Race — GD vs Momentum vs Newton

Same surface, same starting point, three optimizers running in parallel. Watch how using more geometric information — velocity for momentum, the full Hessian for Newton — changes the trajectory.
Gradient Descent23.920
Momentum23.920
Newton's Method23.920

Same surface, same starting point, three optimizers. Gradient descent zig-zags down narrow valleys; momentum carries velocity through the bends; Newton's method uses the Hessian to jump straight toward the minimum. Click the canvas to relocate the starting point.

Interactive: Saddle Points & the Hessian

At any point on the surface, the eigenvalues of the Hessian classify the local geometry. Both positive: a minimum. Both negative: a maximum. Mixed signs: a saddle, where one direction attracts and another repels.
at point:(0.05, 0.05)saddle point
Hessian eigenvalues:λ₁ = 2.000, λ₂ = -2.000trace H:0.000det H:-4.000

The signs of the Hessian's eigenvalues classify the local geometry. On the saddle x² − y², λ₁ = +2 and λ₂ = −2: gradient descent gets attracted along the x-axis but pushed away along the y-axis. Drop the ball exactly on the saddle point — gradient descent never moves. Drop it slightly off-axis and it slides away.

Interactive: SGD Noise on a Multi-Modal Landscape

Stochastic gradient descent uses a noisy estimate of the gradient. That noise is not just an artifact — it can carry the optimizer over a ridge into a deeper basin that batch GD would never find.

Himmelblau's function: four global minima of equal depth. Click anywhere to relocate the start.

batch GD (clean gradient)170.00
SGD (noisy gradient)170.00

Batch gradient descent commits to whichever minimum its initial gradient points toward. Stochastic gradient descent jitters along the way — and that jitter can carry it across ridges into a different basin entirely. Crank the noise slider up and watch SGD wander.

The math objects

  • Loss function: a smooth scalar function L(θ) of the parameters θ. In practice θ is high-dimensional; we visualize 2D slices to build intuition that transfers.
  • Gradient ∇L: a vector field on parameter space pointing in the direction of steepest increase. The negative gradient is the steepest-descent direction.
  • Hessian H: the matrix of second partial derivatives. Its eigenvalues are the principal curvatures, and their signs classify minima, maxima, and saddles.
  • Newton step: x ← x − H⁻¹ ∇L. When the Hessian is positive definite, Newton's method converges quadratically — but each step costs a matrix inversion, which is why we settle for first-order methods at scale.
  • Momentum: a moving average of past gradients. It dampens oscillation across the steep directions of a long valley and speeds up motion along the gentle direction. Geometrically, it's a discretization of a second-order ODE — the trajectory of a damped massive ball.

Key takeaways

  • A loss function is a surface; training is descent.
  • The gradient gives the steepest direction; the Hessian gives the local shape.
  • Different optimizers exploit different amounts of curvature information.
  • Saddle points are where gradient descent slows — the escape direction is the negative-eigenvalue eigenvector of the Hessian.
  • Stochasticity is a feature, not a bug: SGD's noise lets it cross ridges that deterministic methods cannot.