KL Divergence

The cost in extra bits of coding for the wrong distribution. Cross-entropy and the bridge to ML.

KL Divergence

Suppose you build the perfect compression code for a distribution q — the code lengths −log₂ q(x) are minimal in expectation under q. Then the world hands you data sampled from a different distribution, p. How many extra bits per symbol do you pay? The answer is the Kullback-Leibler divergence:

D(p ‖ q) = Σ p(x) log [p(x) / q(x)]

That is the entire object. It is the gap between the average code length you actually achieve, called the cross-entropy H(p, q) = −Σ p log q, and the optimal rate H(p) you would have hit with a code matched to p. The bookkeeping lines up exactly:

H(p, q) = H(p) + D(p ‖ q)

KL is always nonnegative (Gibbs' inequality), zero if and only if p equals q, and — crucially — asymmetric: in general D(p ‖ q) does not equal D(q ‖ p), so KL is not a metric. It is a one-sided cost, and the side it sees from matters.

See also: Information Theory in ML for KL and cross-entropy as the loss function for classification, and Shannon Entropy for the H(p) you are subtracting from H(p, q).

Interactive: The Asymmetry of KL

Two categorical distributions p (orange) and q (blue) over the same alphabet. Pick which one to edit, then drag any bar — the others rescale. Watch D(p ‖ q) and D(q ‖ p) drift apart as you push the distributions around. The Support Mismatch preset shows the dramatic case: q says a symbol is impossible while p still uses it, and KL diverges to infinity.
presets:
edit:
D(p ‖ q)
0.4830 bits
extra cost when truth is p, code is for q
D(q ‖ p)
0.5510 bits
flip the roles — usually a different number

Pick which distribution you are editing, then drag any orange or blue bar up or down. The other bars in that distribution rescale automatically. Notice that D(p ‖ q) and D(q ‖ p) are usually different — that is the famous asymmetry of KL. Try the Support Mismatch preset: q assigns probability zero to a symbol p still uses, and D(p ‖ q) blows up to infinity.

Interactive: Cross-Entropy as Wasted Bits

A source emits symbols from p, but you encode them with a Huffman code built for q. The blue stacks under each symbol show that code's actual length; the dashed orange ticks are the optimal lengths −log₂ p_i. Slide the morph slider to walk q toward p — watch the cross-entropy collapse to H(p) and the wasted bits D(p ‖ q) collapse to zero.
source p:
H(p)
2.420
optimal rate
H(p, q)
3.128
cross-entropy
D(p ‖ q)
0.708
wasted bits
L̄ (Huffman)
3.220
avg code length

Symbols come from the source distribution p but you encode them with a Huffman code built for q. The blue stacks under each symbol show that code's actual length; the dashed orange tick is the ideal length −log₂ p_i. Drag the slider to morph q toward p — the code lengths shift to match, the cross-entropy H(p, q) falls toward the entropy H(p), and the gap D(p ‖ q) — the wasted bits per symbol — collapses to zero. Cross-entropy is exactly H(p) plus that waste.

Interactive: KL Between Two Gaussians

Two univariate Gaussians on the same axes. Slide the means and variances; the closed-form KL is computed in both directions. Use the swap buttons to expose the asymmetry: swapping means leaves D unchanged when σ_p = σ_q, but swapping variances generally changes it.
p (orange)
q (blue)
D(p ‖ q)
1.0166 nats / 1.4666 bits
D(q ‖ p)
2.2195 nats / 3.2021 bits

Closed form: D = log(σ_q/σ_p) + (σ_p² + (μ_p − μ_q)²) / (2σ_q²) − 1/2. When σ_p = σ_q, swapping the means leaves D unchanged — the formula only sees (μ_p − μ_q)². But swapping the variances usually changes D, because σ_p and σ_q play different roles in the closed form. Setting μ_p = μ_q and σ_p = σ_q gives D = 0 exactly.

The math objects

  • Relative entropy D(p ‖ q) = Σ p log(p/q): the average extra bits paid when symbols drawn from p are encoded with a code optimized for q. Always ≥ 0 (Gibbs' inequality), equal to zero only when p = q. If there is any x with p(x) > 0 and q(x) = 0, the divergence is infinite — q has declared an event impossible that p still produces.
  • Cross-entropy H(p, q) = H(p) + D(p ‖ q): the actual rate you pay coding p with a q-tuned code. Because H(p) is fixed (it depends only on the source), minimizing H(p, q) over q is exactly minimizing D(p ‖ q). That is why log-loss is the canonical training objective in classification.
  • Asymmetry: in general D(p ‖ q) ≠ D(q ‖ p). The two have different operational meanings — D(p ‖ q) is the cost of using q when nature produces p; D(q ‖ p) is the cost of using p when nature produces q. Variational inference uses the "reverse" direction to fit zero-forcing approximations.
  • Not a metric: KL fails the symmetry axiom and the triangle inequality, so it does not metrize the simplex. Symmetrized variants such as the Jensen-Shannon divergence (½ D(p ‖ m) + ½ D(q ‖ m), m = ½(p + q)) restore symmetry; the square root of JS is even a true metric.
  • Gibbs' inequality: the proof that D(p ‖ q) ≥ 0 reduces to the concavity of log via Jensen. Equivalently, log x ≤ x − 1, which gives the bound directly. The same inequality is the engine behind the maximum-entropy theorem, the data-processing inequality, and the convergence of EM.
  • Gaussian closed form: for two univariate normals, D(N(μ_p, σ_p²) ‖ N(μ_q, σ_q²)) = log(σ_q/σ_p) + (σ_p² + (μ_p − μ_q)²)/(2σ_q²) − 1/2. The asymmetry sits in the variances: σ_p and σ_q play different roles. Multivariate Gaussians have a similarly clean form involving the covariance trace and a log-determinant.

Key takeaways

  • D(p ‖ q) = Σ p log(p/q) is the extra bits per symbol you pay for coding samples from p with a code optimal for q.
  • Cross-entropy H(p, q) = H(p) + D(p ‖ q): minimizing one (over q) is minimizing the other.
  • KL is nonnegative, zero iff p = q, and asymmetric — D(p ‖ q) is generally not D(q ‖ p).
  • If q assigns probability zero to a symbol p still produces, KL is infinite — the cost of being told an actual event is impossible.
  • The Gaussian closed form makes KL between normals a one-line formula — the workhorse of variational inference.
  • KL is not a metric, but it underlies most divergences in statistics: cross-entropy, log-likelihood, mutual information, and Jensen-Shannon.