Shannon Entropy

H(X) = −Σ p log p. The fundamental measure of uncertainty in a probability distribution.

Shannon Entropy

In 1948, Claude Shannon asked a deceptively simple question: how much information is in a message? He answered by defining a single quantity — the entropy of a probability distribution — that turned out to be the right answer to almost every question about information, coding, and communication. Compression bounds, channel capacity, machine learning loss functions, even thermodynamic free energy: they are all entropy in disguise.

The formula is short. For a discrete random variable X taking values with probabilities p₁, p₂, …, p_n, the Shannon entropy is H(X) = −Σᵢ pᵢ log pᵢ. The base of the logarithm chooses the unit: base 2 gives bits, base e gives nats, base 10 gives hartleys. The convention 0 log 0 = 0 (from continuity) makes the formula well defined when some outcomes are impossible.

Interactive: Drag a Distribution

Click and drag any bar up or down to reshape the probability distribution. The remaining bars rescale automatically so the total always sums to 1. Watch how the entropy peaks at the uniform distribution and drops to zero as one outcome takes over.
presets:
H(X)
2.000
max H = log₂ k
2.000
efficiency
100.0%

Drag any bar up or down. The other bars rescale automatically so the distribution still sums to 1. Entropy is maximized when the distribution is uniform — every outcome equally surprising. A near-deterministic distribution carries almost no information.

Interactive: The Binary Entropy Curve

For a coin with bias p, the entropy is H(p) = −p log p − (1−p) log (1−p). The result is the iconic concave bell, peaking at exactly 1 bit when p = 1/2 and falling to 0 at the deterministic ends. Drag the marker to see the value at any bias.

H(p) = −p log₂ p − (1−p) log₂ (1−p). Click or drag along the curve to move the marker.

A fair coin (p = 1/2) carries exactly 1 bit. A coin biased to almost always come up heads (p near 1) carries almost no information — you already know the answer. The curve is symmetric: a coin that always comes up tails is just as predictable as one that always comes up heads.

Interactive: Self-Information

The self-information of a single outcome is I(x) = −log p(x). Rare events carry more bits than common ones — that is precisely why a winning lottery ticket is so newsworthy. Click the named benchmarks to see well-known events on the curve.

Self-information of a single outcome: I(x) = −log₂ p(x). Rare outcomes carry more bits.

probability
1.000e-1
self-information
3.322 bits

Interactive: Entropy of Real Text

The single-letter entropy of English is around 4.0 to 4.2 bits — well below the 4.70 bits a uniform 26-letter alphabet would carry. That gap is redundancy, and it is exactly what every compression algorithm exploits.

It is a truth universally acknowledged that a single man in possession of a good fortune must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families that he is considered the rightful property of some one or other of their daughters.

H of letters
4.107 bits
uniform max log₂ 26
4.700 bits
redundancy
12.6%

Single-letter entropy of English is around 4.0 to 4.2 bits — well below the 4.70 bits a uniform 26-letter alphabet would carry. The gap is redundancy, and it is exactly what compression algorithms exploit. Repeated text has almost zero entropy; pseudo-uniform text approaches the upper bound.

The math objects

  • Probability distribution: a finite list p₁, …, p_n of nonnegative numbers summing to 1. The math objects of information theory are functions of these distributions.
  • Entropy H(X): the expected value of the self-information, E[−log p(X)]. It measures the average uncertainty you have about a sample from X — equivalently, the average number of bits needed to identify it.
  • Maximum entropy: for a discrete X with k possible values, H(X) ≤ log k, with equality if and only if X is uniform. This is one of the most useful inequalities in all of statistics.
  • Self-information I(x) = −log p(x): a per-outcome quantity. Rare outcomes carry more bits because observing them reduces your uncertainty more.
  • Units (bit / nat / hartley): just a choice of logarithm base (2, e, 10). One bit ≈ 0.693 nats ≈ 0.301 hartleys. Engineers tend to use bits; theorists tend to use nats because natural logs play nicely with calculus.

Key takeaways

  • Entropy is the average uncertainty in a distribution, measured in bits.
  • The uniform distribution has the maximum possible entropy: log₂ k.
  • A deterministic distribution (one outcome with probability 1) has zero entropy.
  • Rare events carry more self-information than common ones — that's why surprise feels informative.
  • English text has roughly 4.1 bits per letter, well below the 4.7 bit ceiling — the rest is redundancy.