Source Coding & Huffman

Shannon's first theorem: entropy is the optimal compression bound. Huffman codes achieve it.

Source Coding & Huffman

A source coder turns symbols into bits. The question Shannon answered in 1948 — and the question every compression algorithm since has inherited — is: how few bits, on average, can you possibly use? The answer is exactly the entropy of the source. The source-coding theorem states that the average code length L of any uniquely-decodable code satisfies L ≥ H(X). And constructive matches the bound: Huffman coding achieves L within one bit of H(X) for any discrete source, and zero bits in the limit when the codeword lengths can be chosen as exact log probabilities.

The recipe is intuitive. Common symbols deserve short codes; rare symbols can afford long ones. Huffman's algorithm just keeps merging the two least-probable nodes into a new parent until a single binary tree remains. Reading the tree top-down — left as 0, right as 1 — gives a prefix-free codebook, the optimal one. The Kraft inequality Σ 2^(−ℓᵢ) ≤ 1 turns out to be the combinatorial budget that all prefix codes must respect: each codeword of length ℓ consumes 2^(−ℓ) of a unit pie. Equality means the tree is full and no budget is wasted.

Interactive: Huffman Tree Builder

Pick a probability distribution (or sketch your own with the sliders) and step through Huffman's algorithm. At each merge the two lowest-probability nodes in the forest become children of a new parent. The codes for each symbol fall out of the final tree.
presets:
merges so far: 0 / 4
Resulting Huffman code
E...(- bits)
T...(- bits)
A...(- bits)
O...(- bits)
I...(- bits)
H(X)
2.287 bits
L (Huffman avg)
2.326 bits
L − H
0.039 bits

Step through the algorithm: at each merge, the two lowest-probability nodes in the forest are combined under a new internal parent. Read codewords by walking the final tree top-down (left = 0, right = 1). Shannon's source-coding theorem guarantees L ≥ H(X); Huffman comes within one bit.

Interactive: Compression Compared

The same text encoded three ways: a fixed-length code (⌈log₂ k⌉ bits per symbol), Huffman's variable-length code, and the entropy lower bound. The Huffman bar should sit between fixed and entropy — never below the entropy line, never more than one bit above it.

IT IS A TRUTH UNIVERSALLY ACKNOWLEDGED THAT A SINGLE MAN IN POSSESSION OF A GOOD FORTUNE MUST BE IN WANT OF A WIFE HOWEVER LITTLE KNOWN THE FEELINGS OR VIEWS OF SUCH A MAN MAY BE ON HIS FIRST ENTERING A NEIGHBOURHOOD THIS TRUTH IS SO WELL FIXED IN THE MINDS OF THE SURROUNDING FAMILIES

Symbols (k)
23
Length (chars)
231
Huffman / fixed
83.2%
Slack L − H
9.4 bits
First 30 symbols, Huffman-encoded
0011011001111010001011000111000101101101100011110010111110100001111010000000000010101101000101010010101011111110110100000001010010

Three encodings of the same stream. The fixed-length code wastes bits whenever the distribution is non-uniform. Huffman shrinks toward the entropy lower bound (Shannon's theorem); the gap L − H stays under one bit. With a single repeated symbol the entropy is zero and Huffman degenerates to a useless 1-bit code — the only case where block coding can do meaningfully better. Compression efficiency equals 82.4% of fixed.

Interactive: The Kraft Inequality

A binary tree of depth 3-5; click any node to claim it as a codeword. Each claim consumes 2^(−ℓ) of the budget [0, 1] and grays out its ancestors and descendants (which would break the prefix property). The bar shows Σ 2^(−ℓᵢ); equality with 1 means the tree is full.
presets:
Σ 2^(−ℓᵢ)0.0000
Valid prefix code
Codewords
0
Lengths
Budget used
0.0%
Slots blocked
0
Claimed codewords
none — click any node to claim

Click any node to claim it as a codeword. The Kraft inequality says Σ 2^(−ℓᵢ) ≤ 1 for any prefix code — every length-ℓ codeword consumes 2^(−ℓ) of the unit budget. A claim grays out its ancestors and descendants (they would no longer give a prefix-free code). Equality Σ = 1 happens exactly when the tree is full — no wasted budget. Try the Huffman 1, 2, 3, 3 preset: ½ + ¼ + ⅛ + ⅛ = 1, the canonical full tree.

The math objects

  • Source-coding theorem (Shannon, 1948): for any uniquely-decodable code over a source X with entropy H(X) bits, the average codeword length L = Σ pᵢ ℓᵢ satisfies L ≥ H(X). Conversely, there exists a prefix code with L < H(X) + 1. Block-coding the source closes the remaining gap to zero in the limit.
  • Prefix code: a code in which no codeword is the prefix of another. Equivalent to placing codewords at the leaves of a binary tree, which is why every prefix code can be drawn and decoded by walking from the root.
  • Kraft inequality: Σ 2^(−ℓᵢ) ≤ 1. Necessary and sufficient for codeword lengths {ℓᵢ} to admit a prefix code. Equality means the binary tree is full — no leaf is wasted.
  • Huffman's algorithm: repeatedly merge the two lowest-probability nodes into a new parent; recurse on the resulting forest. The final tree is the optimal prefix code for the given distribution. Cost: O(n log n).
  • Optimality gap L − H: always less than 1 bit for Huffman, and exactly zero when every probability is a power of ½ (so the optimal length ℓᵢ = −log₂ pᵢ is an integer). Real compressors close the gap further with arithmetic coding and block coding.

Key takeaways

  • Entropy H(X) is the absolute lower bound on the average code length of any lossless source code.
  • Huffman coding constructs the optimal prefix code: greedy bottom-up merging of the two smallest probabilities.
  • The Huffman gap L − H stays under one bit per symbol — and reaches zero when every probability is a power of ½.
  • The Kraft inequality Σ 2^(−ℓᵢ) ≤ 1 is the budget every prefix code obeys; equality means the binary tree is full.
  • Fixed-length coding wastes bits whenever the source distribution is non-uniform — exactly the redundancy compressors exploit.