The cost in extra bits of coding for the wrong distribution. Cross-entropy and the bridge to ML.
Suppose you build the perfect compression code for a distribution q — the code lengths −log₂ q(x) are minimal in expectation under q. Then the world hands you data sampled from a different distribution, p. How many extra bits per symbol do you pay? The answer is the Kullback-Leibler divergence:
D(p ‖ q) = Σ p(x) log [p(x) / q(x)]
That is the entire object. It is the gap between the average code length you actually achieve, called the cross-entropy H(p, q) = −Σ p log q, and the optimal rate H(p) you would have hit with a code matched to p. The bookkeeping lines up exactly:
H(p, q) = H(p) + D(p ‖ q)
KL is always nonnegative (Gibbs' inequality), zero if and only if p equals q, and — crucially — asymmetric: in general D(p ‖ q) does not equal D(q ‖ p), so KL is not a metric. It is a one-sided cost, and the side it sees from matters.
See also: Information Theory in ML for KL and cross-entropy as the loss function for classification, and Shannon Entropy for the H(p) you are subtracting from H(p, q).
Pick which distribution you are editing, then drag any orange or blue bar up or down. The other bars in that distribution rescale automatically. Notice that D(p ‖ q) and D(q ‖ p) are usually different — that is the famous asymmetry of KL. Try the Support Mismatch preset: q assigns probability zero to a symbol p still uses, and D(p ‖ q) blows up to infinity.
Symbols come from the source distribution p but you encode them with a Huffman code built for q. The blue stacks under each symbol show that code's actual length; the dashed orange tick is the ideal length −log₂ p_i. Drag the slider to morph q toward p — the code lengths shift to match, the cross-entropy H(p, q) falls toward the entropy H(p), and the gap D(p ‖ q) — the wasted bits per symbol — collapses to zero. Cross-entropy is exactly H(p) plus that waste.
Closed form: D = log(σ_q/σ_p) + (σ_p² + (μ_p − μ_q)²) / (2σ_q²) − 1/2. When σ_p = σ_q, swapping the means leaves D unchanged — the formula only sees (μ_p − μ_q)². But swapping the variances usually changes D, because σ_p and σ_q play different roles in the closed form. Setting μ_p = μ_q and σ_p = σ_q gives D = 0 exactly.