Nonlinear Optimization

Loss landscapes, saddle points, and simulated annealing

Nonlinear Optimization

Non-convex optimization is the wild west: multiple local minima, saddle points, and plateaus make finding the global optimum fundamentally difficult. Understanding the loss landscape is essential for designing algorithms that can navigate it.

Loss Landscape Explorer

Click to drop gradient descent balls from different starting points. Each follows the gradient to a local minimum -- but different starts often find different minima. This is the fundamental challenge of non-convex optimization.

Click different locations to see how gradient descent finds different local minima in a non-convex landscape.

Key insight: Random restarts are a simple but effective strategy: run GD from many starting points and keep the best result. More sophisticated methods include basin-hopping and genetic algorithms.

Saddle Point Detector

A saddle point is a critical point that is a minimum in some directions and a maximum in others. The Hessian's eigenvalues reveal the type: all positive = minimum, all negative = maximum, mixed = saddle.

Eigenvalues of the Hessian determine the critical point type. Dashed lines show eigenvector directions. Balls oscillate to show stability vs instability.

Key insight: In high-dimensional optimization (like deep learning), saddle points are far more common than local minima. Algorithms that can escape saddle points are essential.

Simulated Annealing

Simulated annealing escapes local minima by occasionally accepting uphill moves. The probability of accepting a worse solution decreases with "temperature," which cools over time. Early on, the algorithm explores freely; later, it settles into the best basin found.

Auto-cool

Temperature2.000

Gradient descent (gray) gets stuck in the nearest local minimum. Simulated annealing (indigo) accepts uphill moves with decreasing probability, often finding the global minimum.

Key insight: SA is inspired by metallurgical annealing. With a sufficiently slow cooling schedule, SA is guaranteed to find the global optimum -- but "sufficiently slow" can mean exponentially slow.

Key Takeaways

Non-convex landscapes have multiple local minima; different starting points lead to different solutions.
Saddle points (mixed Hessian eigenvalues) are more common than local minima in high dimensions.
Simulated annealing escapes local traps by accepting uphill moves with decreasing probability.