H(X) = −Σ p log p. The fundamental measure of uncertainty in a probability distribution.
In 1948, Claude Shannon asked a deceptively simple question: how much information is in a message? He answered by defining a single quantity — the entropy of a probability distribution — that turned out to be the right answer to almost every question about information, coding, and communication. Compression bounds, channel capacity, machine learning loss functions, even thermodynamic free energy: they are all entropy in disguise.
The formula is short. For a discrete random variable X taking values with probabilities p₁, p₂, …, p_n, the Shannon entropy is H(X) = −Σᵢ pᵢ log pᵢ. The base of the logarithm chooses the unit: base 2 gives bits, base e gives nats, base 10 gives hartleys. The convention 0 log 0 = 0 (from continuity) makes the formula well defined when some outcomes are impossible.
Drag any bar up or down. The other bars rescale automatically so the distribution still sums to 1. Entropy is maximized when the distribution is uniform — every outcome equally surprising. A near-deterministic distribution carries almost no information.
H(p) = −p log₂ p − (1−p) log₂ (1−p). Click or drag along the curve to move the marker.
A fair coin (p = 1/2) carries exactly 1 bit. A coin biased to almost always come up heads (p near 1) carries almost no information — you already know the answer. The curve is symmetric: a coin that always comes up tails is just as predictable as one that always comes up heads.
Self-information of a single outcome: I(x) = −log₂ p(x). Rare outcomes carry more bits.
It is a truth universally acknowledged that a single man in possession of a good fortune must be in want of a wife. However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families that he is considered the rightful property of some one or other of their daughters.
Single-letter entropy of English is around 4.0 to 4.2 bits — well below the 4.70 bits a uniform 26-letter alphabet would carry. The gap is redundancy, and it is exactly what compression algorithms exploit. Repeated text has almost zero entropy; pseudo-uniform text approaches the upper bound.