Information Theory in ML

Entropy, KL divergence, and cross-entropy — why log-loss is the natural objective for classification.

Information Theory in ML

Machine learning models output distributions, not numbers. A classifier doesn't say "dog" — it says "87% dog, 11% wolf, 2% other." Once predictions are distributions, the natural way to score them is the same machinery Claude Shannon built in 1948 to measure information: entropy, cross-entropy, and the KL divergence. These three quantities are not separate inventions — they sit on the same mathematical scaffold, and once you see it the loss function falls out for free.

Shannon entropy H(p) = − Σ p log p measures how much uncertainty a distribution carries. Maximum entropy is uniformity; a one-hot distribution has zero entropy because there's nothing to learn. KL divergence KL(p ‖ q) = Σ p log(p/q) measures how far q is from p — but in a one-sided way: it's the number of extra bits you spend coding samples from p with a code optimized for q. KL is asymmetric, always non-negative, and zero only when p = q. Cross-entropy H(p, q) = − Σ p log q is just H(p) + KL(p ‖ q): minimizing it is exactly minimizing KL when the true distribution p is fixed. That is why classifiers are trained with log-loss.

Interactive: Entropy Explorer

A categorical distribution over k classes shown as bars. Drag any bar to change its probability — the others rescale to keep the total at 1. Watch entropy hit log₂(k) at the uniform distribution and collapse to zero on a one-hot.
Entropy H(p)
2.000 bits
Maximum log₂(k)
2.000 bits
Efficiency H/Hₘₐₓ
100.0%

Drag any bar to change its probability — the others rescale so the total stays at 1. Try the Uniform preset and watch entropy hit the dashed maximum log₂(k). Try One-hot and watch entropy collapse to zero — perfect certainty has no information content.

Interactive: KL Divergence Between Two Gaussians

Two univariate Gaussians on the same axes. Drag means and variances; KL is computed via the Gaussian closed form. Toggle the order to see the famous asymmetry: KL(p ‖ q) ≠ KL(q ‖ p) in general.
Showing:
p (emerald)
q (blue)
KL(p ‖ q)
1.0166 nats / 1.4666 bits
KL(q ‖ p)
2.2195 nats / 3.2021 bits

Slide the means together — both KLs go to zero. Slide them apart, or change the variances, and the two values diverge: KL is asymmetric. Set μ_p = μ_q and σ_p = σ_q to see KL = 0 exactly. The closed form for Gaussians is log(σ_q/σ_p) + (σ_p² + (μ_p−μ_q)²)/(2σ_q²) − 1/2.

Interactive: Cross-Entropy as Classification Loss

Binary first: true label is 1, slide the predicted probability and watch −log p plotted live. Switch to multi-class: four classes, true is class 2, sliders feed a softmax, and cross-entropy reduces to −log q[true].
p (class = 1)0.700
Loss − log p
0.3567 nats
Limit behaviour
p → 1: loss → 0
p → 0: loss → ∞

Single example, true label = 1. Cross-entropy reduces to −log p. Confident-and-correct gives zero loss. Confident-and-wrong is unboundedly punished — that asymmetry is why log-loss penalises overconfidence in classifiers.

The math objects

  • Shannon entropy: H(p) = − Σ p_i log p_i. Units are bits when the log is base 2, nats when natural. H ≥ 0 always; H = 0 iff p is one-hot; H = log(k) (its maximum over k classes) iff p is uniform.
  • KL divergence: KL(p ‖ q) = Σ p_i log(p_i / q_i). Always ≥ 0 (Gibbs' inequality), and zero iff p = q. Asymmetric — KL(p ‖ q) ≠ KL(q ‖ p) in general — so it is not a metric, but a one-sided "cost of using q when the truth is p."
  • Cross-entropy: H(p, q) = − Σ p_i log q_i = H(p) + KL(p ‖ q). When p is the (fixed) true distribution, minimizing H(p, q) over q is exactly minimizing KL(p ‖ q). For one-hot labels, H(p, q) collapses to −log q[true] — the log-loss every classifier optimizes.
  • Gaussian closed forms: the differential entropy of N(μ, σ²) is ½ log(2πe σ²) (depends only on σ); the KL between two univariate Gaussians is log(σ_q/σ_p) + (σ_p² + (μ_p − μ_q)²)/(2σ_q²) − 1/2. These are the workhorses of variational inference, where divergences between Gaussians appear in every ELBO.
  • Softmax: the canonical map from real-valued logits z to a probability distribution: q_i = exp(z_i) / Σ_j exp(z_j). Composed with cross-entropy, the gradient against z simplifies to q − p — the classic "softmax-cross-entropy" result that makes training neural networks tractable.

Key takeaways

  • Entropy measures uncertainty; uniform is maximum, one-hot is zero.
  • KL divergence measures how far one distribution is from another — non-negative, asymmetric, zero iff equal.
  • Cross-entropy = entropy + KL; minimizing cross-entropy is minimizing KL when the truth is fixed.
  • For one-hot labels, cross-entropy collapses to −log q[true class] — the log-loss of every classifier.
  • The softmax-cross-entropy gradient is q − p — clean, scale-invariant, and the reason classification networks train at all.