Gradient Descent

Learning rates, momentum, and Adam optimizer

Gradient Descent

Gradient descent is the workhorse of continuous optimization and machine learning. The idea is simple: move in the direction of steepest decrease (the negative gradient), taking steps proportional to a learning rate. The challenge lies in choosing the right step size and dealing with ill-conditioned landscapes.

Gradient Descent Explorer

Click to place a starting point on the contour plot. Watch gradient descent trace a path toward the minimum, following the steepest direction at each step. Adjust the learning rate to see how it affects convergence.

Learning rate: 0.050

Iterations: 0

Click anywhere on the contour plot to set a starting point. Gradient descent will animate toward the minimum. Adjust the learning rate to see its effect on convergence.

Key insight: On a convex function, GD converges to the global minimum from any starting point. The Rosenbrock function shows how narrow valleys slow convergence dramatically.

Learning Rate Comparison

The learning rate controls step size. Too small and convergence is glacially slow. Too large and the algorithm oscillates or diverges. Three simultaneous runs show the dramatic effect of this single parameter.

Speed: 1x

Iter: 0

lr=0.01 (small)

f = 46.2500

lr=0.08 (good)

f = 46.2500

lr=0.4 (large)

f = 46.2500

Three gradient descent runs with different learning rates. Too small is slow, too large oscillates or diverges, and a good rate converges efficiently.

Key insight: The optimal learning rate is related to the inverse of the Lipschitz constant of the gradient. For a quadratic, it is 2/(eigenvalue_max + eigenvalue_min).

Momentum and Adam

Momentum adds a velocity term that accumulates past gradients, helping the optimizer barrel through narrow valleys. Adam adapts the learning rate per-parameter using running averages of gradients and squared gradients. Compare all three on a challenging landscape.

Iteration: 0

Vanilla GD

f = 35.5250

GD + Momentum

f = 35.5250

Adam

f = 35.5250

Vanilla GD zig-zags across the narrow valley. Momentum smooths the path. Adam adapts per-parameter learning rates and converges fastest.

Key insight: Adam is the default optimizer in deep learning because it adapts to the local curvature. Momentum helps escape zig-zagging. Both outperform vanilla GD on ill-conditioned problems.

Key Takeaways

Gradient descent follows the steepest descent direction with a step proportional to the learning rate.
Learning rate is critical: too small = slow, too large = divergence.
Momentum and Adam accelerate convergence on ill-conditioned landscapes.