Entropy, KL divergence, and cross-entropy — why log-loss is the natural objective for classification.
Machine learning models output distributions, not numbers. A classifier doesn't say "dog" — it says "87% dog, 11% wolf, 2% other." Once predictions are distributions, the natural way to score them is the same machinery Claude Shannon built in 1948 to measure information: entropy, cross-entropy, and the KL divergence. These three quantities are not separate inventions — they sit on the same mathematical scaffold, and once you see it the loss function falls out for free.
Shannon entropy H(p) = − Σ p log p measures how much uncertainty a distribution carries. Maximum entropy is uniformity; a one-hot distribution has zero entropy because there's nothing to learn. KL divergence KL(p ‖ q) = Σ p log(p/q) measures how far q is from p — but in a one-sided way: it's the number of extra bits you spend coding samples from p with a code optimized for q. KL is asymmetric, always non-negative, and zero only when p = q. Cross-entropy H(p, q) = − Σ p log q is just H(p) + KL(p ‖ q): minimizing it is exactly minimizing KL when the true distribution p is fixed. That is why classifiers are trained with log-loss.
Drag any bar to change its probability — the others rescale so the total stays at 1. Try the Uniform preset and watch entropy hit the dashed maximum log₂(k). Try One-hot and watch entropy collapse to zero — perfect certainty has no information content.
Slide the means together — both KLs go to zero. Slide them apart, or change the variances, and the two values diverge: KL is asymmetric. Set μ_p = μ_q and σ_p = σ_q to see KL = 0 exactly. The closed form for Gaussians is log(σ_q/σ_p) + (σ_p² + (μ_p−μ_q)²)/(2σ_q²) − 1/2.
Single example, true label = 1. Cross-entropy reduces to −log p. Confident-and-correct gives zero loss. Confident-and-wrong is unboundedly punished — that asymmetry is why log-loss penalises overconfidence in classifiers.