A loss function is a surface — training is rolling down it. Compare gradient descent, momentum, and Newton on real landscapes.
Training a machine learning model means choosing parameters that make some loss function L as small as possible. If your model has two parameters, the loss is a surface over the plane — a landscape, with hills, valleys, ridges, and saddles. Training is the act of rolling a ball down that surface. The math is the geometry of the surface; the algorithm is the rolling rule.
The most basic rule is gradient descent: at each step, take the direction of steepest descent — the negative gradient of L — and move a small distance in that direction. The gradient is the local first derivative; it tells you only what is happening immediately around your current point. Curvature, the second derivative, is encoded in the Hessian. Different optimizers use different amounts of this geometric information, and that choice determines whether you smoothly reach the minimum, oscillate, escape a saddle, or diverge.
A stretched quadratic. Gradient descent zig-zags down the long axis.
Click anywhere to set the starting point. Press Run to descend along the gradient. The dashed crosshair marks a known minimum (when one exists). Try setting the learning rate too high — gradient descent will overshoot or diverge.
Same surface, same starting point, three optimizers. Gradient descent zig-zags down narrow valleys; momentum carries velocity through the bends; Newton's method uses the Hessian to jump straight toward the minimum. Click the canvas to relocate the starting point.
The signs of the Hessian's eigenvalues classify the local geometry. On the saddle x² − y², λ₁ = +2 and λ₂ = −2: gradient descent gets attracted along the x-axis but pushed away along the y-axis. Drop the ball exactly on the saddle point — gradient descent never moves. Drop it slightly off-axis and it slides away.
Himmelblau's function: four global minima of equal depth. Click anywhere to relocate the start.
Batch gradient descent commits to whichever minimum its initial gradient points toward. Stochastic gradient descent jitters along the way — and that jitter can carry it across ridges into a different basin entirely. Crank the noise slider up and watch SGD wander.