Joint, Conditional & Mutual Information

How much knowing one variable tells you about another. The chain rule for entropy.

Joint, Conditional & Mutual Information

One distribution gives you entropy. Two distributions, paired together, give you the entire grammar of information theory: joint entropy, conditional entropy, and the centerpiece of this lesson — mutual information. The objects are easy to write down, and they satisfy a single accounting identity that everything else in the module rests on.

For a pair of discrete random variables (X, Y) with joint distribution p(x, y), the joint entropy is H(X, Y) = −ΣΣ p(x, y) log p(x, y), and the conditional entropy is H(X | Y) = H(X, Y) − H(Y). Their difference is the mutual information:

I(X ; Y) = H(X) + H(Y) − H(X, Y) = H(X) − H(X | Y) = H(Y) − H(Y | X).

In words: I(X ; Y) is the number of bits of uncertainty about X that you eliminate by observing Y. It is symmetric in X and Y, always nonnegative, and equals zero exactly when X and Y are independent.

Interactive: A Joint Distribution by Hand

A draggable n-by-n heatmap of P(X, Y), shown alongside the marginals P(X) and P(Y). Drag any cell to redistribute mass; the six readouts — H(X), H(Y), H(X, Y), H(X | Y), H(Y | X), and I(X ; Y) — update live. Try the Y = X preset to push I(X ; Y) up to log₂ n, and the Independent preset to drive it back to zero.
presets:
H(X)
2.000
H(Y)
2.000
H(X, Y)
4.000
H(X | Y)
2.000
H(Y | X)
2.000
I(X ; Y)
0.000

Each cell of the heatmap is one joint probability P(X = x_i, Y = y_j). Drag a cell up or down to redistribute probability mass; the other cells rescale so the total still sums to 1. The bars on top and right are the marginals P(Y) and P(X). Try the "Y = X" preset: H(X | Y) collapses to 0 and I(X ; Y) jumps to log₂ n. The "Independent" preset drives I(X ; Y) to 0 with no change to the marginals.

Interactive: The Information Venn Diagram

The classic picture: two overlapping circles whose total area is H(X, Y), the X-only crescent is H(X | Y), the Y-only crescent is H(Y | X), and the intersection is I(X ; Y). Slide the agreement parameter q toward 1 and the circles slide together. Slide it to 1/2 and they pull apart — independence collapses I(X ; Y) to zero.

Two binary variables X and Y. The amber crescent is H(X | Y) — what you still don't know about X after seeing Y. The cyan crescent is H(Y | X). Their intersection is I(X ; Y), the bits of overlap.

H(X)
1.000
H(Y)
1.000
H(X, Y)
1.610
H(X | Y)
0.610
H(Y | X)
0.610
I(X ; Y)
0.390

Slide the agreement q toward 1 (Y copies X) and the circles slide together — I(X ; Y) climbs to H(X) and the conditional entropies vanish. Slide q to 1/2 and the circles separate completely — I(X ; Y) drops to zero, recovering H(X, Y) = H(X) + H(Y) for independent variables. q = 0 means Y always disagrees with X, which is just as informative as q = 1.

Interactive: Mutual Information Through a Noisy Channel

A binary symmetric channel: input X is Bernoulli(p), output Y is X with each bit flipped independently with probability ε. The plot shows I(X ; Y) as a function of p for the current noise level. The curve always peaks at p = 1/2 — a uniform input is optimal — and the peak value C = 1 − H(ε) is the Shannon capacity of the channel, the topic of lesson 5.

Binary symmetric channel: input X ~ Bernoulli(p), output Y = X ⊕ noise, where the noise flips each bit independently with probability ε. I(X ; Y) measures how many bits per use the channel actually delivers.

I(X ; Y) at this p
0.5310
capacity C
0.5310
noise H(ε)
0.4690

For any noise level ε, the curve I(X ; Y) versus p peaks at p = 1/2 — a uniform input is optimal. The peak value, C = 1 − H(ε), is the Shannon capacity of the binary symmetric channel. At ε = 0 (noiseless) you get a full bit per use; at ε = 1/2 (pure noise) you get zero bits; and at ε = 1 (deterministic flip) you again get a full bit because the flip is reversible. Lesson 5 will turn this number into a coding theorem.

The math objects

  • Joint distribution P(X, Y): a 2-D table of nonnegative numbers summing to 1. Row sums give the marginal P(X); column sums give P(Y).
  • Joint entropy H(X, Y) = E[−log p(X, Y)]: the entropy of the joint distribution viewed as one big distribution over pairs. It is bounded above by H(X) + H(Y), with equality if and only if X and Y are independent.
  • Conditional entropy H(X | Y) = H(X, Y) − H(Y):the average uncertainty left in X once you know Y. Equivalently, the expected entropy of the conditional distributions p(x | y) averaged over y. It is always nonnegative and at most H(X).
  • Chain rule: H(X, Y) = H(X) + H(Y | X) = H(Y) + H(X | Y). Decompose the uncertainty in (X, Y) any way you like.
  • Mutual information I(X ; Y): the reduction in uncertainty about X you gain by observing Y. Symmetric in X and Y, nonnegative, and zero iff X ⊥ Y. Equivalently, the KL divergence from p(x, y) to p(x) p(y) — a measure of how far the joint is from being independent.
  • Channel capacity (preview): for a noisy channel that produces Y from X, the maximum I(X ; Y) over all input distributions p(x) is the channel capacity C. Lesson 5 promotes this number into a coding theorem: C bits per channel use is the best you can ever do.

Key takeaways

  • I(X ; Y) measures the bits of overlap between X and Y. It is symmetric, nonnegative, and zero iff X and Y are independent.
  • H(X | Y) is what is left over: the uncertainty about X that observing Y does not resolve.
  • The chain rule H(X, Y) = H(X) + H(Y | X) lets you build joint entropies one variable at a time.
  • For independent variables H(X, Y) = H(X) + H(Y), and the Venn diagram circles separate completely.
  • For a deterministic relation Y = f(X) the conditional H(Y | X) collapses to zero, and I(X ; Y) saturates at H(Y).
  • For the binary symmetric channel I(X ; Y) is maximized at p = 1/2; the maximum, 1 − H(ε), is its Shannon capacity.