How much knowing one variable tells you about another. The chain rule for entropy.
One distribution gives you entropy. Two distributions, paired together, give you the entire grammar of information theory: joint entropy, conditional entropy, and the centerpiece of this lesson — mutual information. The objects are easy to write down, and they satisfy a single accounting identity that everything else in the module rests on.
For a pair of discrete random variables (X, Y) with joint distribution p(x, y), the joint entropy is H(X, Y) = −ΣΣ p(x, y) log p(x, y), and the conditional entropy is H(X | Y) = H(X, Y) − H(Y). Their difference is the mutual information:
I(X ; Y) = H(X) + H(Y) − H(X, Y) = H(X) − H(X | Y) = H(Y) − H(Y | X).
In words: I(X ; Y) is the number of bits of uncertainty about X that you eliminate by observing Y. It is symmetric in X and Y, always nonnegative, and equals zero exactly when X and Y are independent.
Each cell of the heatmap is one joint probability P(X = x_i, Y = y_j). Drag a cell up or down to redistribute probability mass; the other cells rescale so the total still sums to 1. The bars on top and right are the marginals P(Y) and P(X). Try the "Y = X" preset: H(X | Y) collapses to 0 and I(X ; Y) jumps to log₂ n. The "Independent" preset drives I(X ; Y) to 0 with no change to the marginals.
Two binary variables X and Y. The amber crescent is H(X | Y) — what you still don't know about X after seeing Y. The cyan crescent is H(Y | X). Their intersection is I(X ; Y), the bits of overlap.
Slide the agreement q toward 1 (Y copies X) and the circles slide together — I(X ; Y) climbs to H(X) and the conditional entropies vanish. Slide q to 1/2 and the circles separate completely — I(X ; Y) drops to zero, recovering H(X, Y) = H(X) + H(Y) for independent variables. q = 0 means Y always disagrees with X, which is just as informative as q = 1.
Binary symmetric channel: input X ~ Bernoulli(p), output Y = X ⊕ noise, where the noise flips each bit independently with probability ε. I(X ; Y) measures how many bits per use the channel actually delivers.
For any noise level ε, the curve I(X ; Y) versus p peaks at p = 1/2 — a uniform input is optimal. The peak value, C = 1 − H(ε), is the Shannon capacity of the binary symmetric channel. At ε = 0 (noiseless) you get a full bit per use; at ε = 1/2 (pure noise) you get zero bits; and at ε = 1 (deterministic flip) you again get a full bit because the flip is reversible. Lesson 5 will turn this number into a coding theorem.