Loss landscapes, saddle points, and simulated annealing
Non-convex optimization is the wild west: multiple local minima, saddle points, and plateaus make finding the global optimum fundamentally difficult. Understanding the loss landscape is essential for designing algorithms that can navigate it.
Click to drop gradient descent balls from different starting points. Each follows the gradient to a local minimum -- but different starts often find different minima. This is the fundamental challenge of non-convex optimization.
Key insight: Random restarts are a simple but effective strategy: run GD from many starting points and keep the best result. More sophisticated methods include basin-hopping and genetic algorithms.
A saddle point is a critical point that is a minimum in some directions and a maximum in others. The Hessian's eigenvalues reveal the type: all positive = minimum, all negative = maximum, mixed = saddle.
Eigenvalues of the Hessian determine the critical point type. Dashed lines show eigenvector directions. Balls oscillate to show stability vs instability.
Key insight: In high-dimensional optimization (like deep learning), saddle points are far more common than local minima. Algorithms that can escape saddle points are essential.
Simulated annealing escapes local minima by occasionally accepting uphill moves. The probability of accepting a worse solution decreases with "temperature," which cools over time. Early on, the algorithm explores freely; later, it settles into the best basin found.
Gradient descent (gray) gets stuck in the nearest local minimum. Simulated annealing (indigo) accepts uphill moves with decreasing probability, often finding the global minimum.
Key insight: SA is inspired by metallurgical annealing. With a sufficiently slow cooling schedule, SA is guaranteed to find the global optimum -- but "sufficiently slow" can mean exponentially slow.