KL Divergence

Interactive: The Asymmetry of KL

Two categorical distributions p (orange) and q (blue) over the same alphabet. Pick which one to edit, then drag any bar — the others rescale. Watch D(p ‖ q) and D(q ‖ p) drift apart as you push the distributions around. The Support Mismatch preset shows the dramatic case: q says a symbol is impossible while p still uses it, and KL diverges to infinity.

presets:

edit:

alphabet size k5

D(p &Vert; q)

0.4830 bits

extra cost when truth is p, code is for q

D(q &Vert; p)

0.5510 bits

flip the roles — usually a different number

Pick which distribution you are editing, then drag any orange or blue bar up or down. The other bars in that distribution rescale automatically. Notice that D(p &Vert; q) and D(q &Vert; p) are usually different — that is the famous asymmetry of KL. Try the Support Mismatch preset: q assigns probability zero to a symbol p still uses, and D(p &Vert; q) blows up to infinity.

Interactive: Cross-Entropy as Wasted Bits

A source emits symbols from p, but you encode them with a Huffman code built for q. The blue stacks under each symbol show that code's actual length; the dashed orange ticks are the optimal lengths −log₂ p_i. Slide the morph slider to walk q toward p — watch the cross-entropy collapse to H(p) and the wasted bits D(p ‖ q) collapse to zero.

source p:

morph q toward p0.00

H(p)

2.420

optimal rate

H(p, q)

3.128

cross-entropy

D(p &Vert; q)

0.708

wasted bits

L̄ (Huffman)

3.220

avg code length

Symbols come from the source distribution p but you encode them with a Huffman code built for q. The blue stacks under each symbol show that code's actual length; the dashed orange tick is the ideal length −log₂ p_i. Drag the slider to morph q toward p — the code lengths shift to match, the cross-entropy H(p, q) falls toward the entropy H(p), and the gap D(p &Vert; q) — the wasted bits per symbol — collapses to zero. Cross-entropy is exactly H(p) plus that waste.

Interactive: KL Between Two Gaussians

Two univariate Gaussians on the same axes. Slide the means and variances; the closed-form KL is computed in both directions. Use the swap buttons to expose the asymmetry: swapping means leaves D unchanged when σ_p = σ_q, but swapping variances generally changes it.

p (orange)

μ_p-1.00σ_p1.00

q (blue)

μ_q1.00σ_q1.50

D(p &Vert; q)

1.0166 nats / 1.4666 bits

D(q &Vert; p)

2.2195 nats / 3.2021 bits

Closed form: D = log(σ_q/σ_p) + (σ_p² + (μ_p − μ_q)²) / (2σ_q²) − 1/2. When σ_p = σ_q, swapping the means leaves D unchanged — the formula only sees (μ_p − μ_q)². But swapping the variances usually changes D, because σ_p and σ_q play different roles in the closed form. Setting μ_p = μ_q and σ_p = σ_q gives D = 0 exactly.

The math objects

Relative entropy D(p &Vert; q) = Σ p log(p/q): the average extra bits paid when symbols drawn from p are encoded with a code optimized for q. Always ≥ 0 (Gibbs' inequality), equal to zero only when p = q. If there is any x with p(x) > 0 and q(x) = 0, the divergence is infinite — q has declared an event impossible that p still produces.
Cross-entropy H(p, q) = H(p) + D(p &Vert; q): the actual rate you pay coding p with a q-tuned code. Because H(p) is fixed (it depends only on the source), minimizing H(p, q) over q is exactly minimizing D(p &Vert; q). That is why log-loss is the canonical training objective in classification.
Asymmetry: in general D(p &Vert; q) ≠ D(q &Vert; p). The two have different operational meanings — D(p &Vert; q) is the cost of using q when nature produces p; D(q &Vert; p) is the cost of using p when nature produces q. Variational inference uses the "reverse" direction to fit zero-forcing approximations.
Not a metric: KL fails the symmetry axiom and the triangle inequality, so it does not metrize the simplex. Symmetrized variants such as the Jensen-Shannon divergence (½ D(p &Vert; m) + ½ D(q &Vert; m), m = ½(p + q)) restore symmetry; the square root of JS is even a true metric.
Gibbs' inequality: the proof that D(p &Vert; q) ≥ 0 reduces to the concavity of log via Jensen. Equivalently, log x ≤ x − 1, which gives the bound directly. The same inequality is the engine behind the maximum-entropy theorem, the data-processing inequality, and the convergence of EM.
Gaussian closed form: for two univariate normals, D(N(μ_p, σ_p²) &Vert; N(μ_q, σ_q²)) = log(σ_q/σ_p) + (σ_p² + (μ_p − μ_q)²)/(2σ_q²) − 1/2. The asymmetry sits in the variances: σ_p and σ_q play different roles. Multivariate Gaussians have a similarly clean form involving the covariance trace and a log-determinant.

Interactive: The Asymmetry of KL

Interactive: Cross-Entropy as Wasted Bits

Interactive: KL Between Two Gaussians

The math objects

Key takeaways