Learning rates, momentum, and Adam optimizer
Gradient descent is the workhorse of continuous optimization and machine learning. The idea is simple: move in the direction of steepest decrease (the negative gradient), taking steps proportional to a learning rate. The challenge lies in choosing the right step size and dealing with ill-conditioned landscapes.
Click to place a starting point on the contour plot. Watch gradient descent trace a path toward the minimum, following the steepest direction at each step. Adjust the learning rate to see how it affects convergence.
Click anywhere on the contour plot to set a starting point. Gradient descent will animate toward the minimum. Adjust the learning rate to see its effect on convergence.
Key insight: On a convex function, GD converges to the global minimum from any starting point. The Rosenbrock function shows how narrow valleys slow convergence dramatically.
The learning rate controls step size. Too small and convergence is glacially slow. Too large and the algorithm oscillates or diverges. Three simultaneous runs show the dramatic effect of this single parameter.
Three gradient descent runs with different learning rates. Too small is slow, too large oscillates or diverges, and a good rate converges efficiently.
Key insight: The optimal learning rate is related to the inverse of the Lipschitz constant of the gradient. For a quadratic, it is 2/(eigenvalue_max + eigenvalue_min).
Momentum adds a velocity term that accumulates past gradients, helping the optimizer barrel through narrow valleys. Adam adapts the learning rate per-parameter using running averages of gradients and squared gradients. Compare all three on a challenging landscape.
Vanilla GD zig-zags across the narrow valley. Momentum smooths the path. Adam adapts per-parameter learning rates and converges fastest.
Key insight: Adam is the default optimizer in deep learning because it adapts to the local curvature. Momentum helps escape zig-zagging. Both outperform vanilla GD on ill-conditioned problems.