The Asymptotic Equipartition Property

Long sequences from a source are almost certainly typical — and almost equally probable.

The Asymptotic Equipartition Property

Take an i.i.d. source X with entropy H(X) and look at a very long sample sequence X₁, X₂, …, X_n. The Asymptotic Equipartition Property — the AEP — says that almost every such sequence has essentially the same probability: roughly 2^(−nH(X)). Equivalently, the per-symbol log-probability concentrates on H(X):

−(1/n) log P(X₁…X_n) → H(X) almost surely as n → ∞. This is just the law of large numbers applied to the random variable −log p(X), whose expected value is exactly the entropy.

The consequence is dramatic. Out of the 2^n possible binary sequences of length n, almost all of the probability mass sits on a much smaller subset — the typical set T_ε^(n) of size ≈ 2^(nH(X)). When the source is biased, that subset is a tiny sliver of the full cube. This is the mathematical reason compression works: long messages from a structured source live on a vanishingly small fraction of the possible sequence space, and we only need enough bits to address that small fraction.

Interactive: Watch the Histogram Concentrate

Pick a source distribution. We sample 600 length-n sequences and plot a histogram of −(1/n) log P(seq) for each. As n grows, the histogram tightens around H(X). The shaded band marks the ε-typical region: sequences whose per-symbol rate falls within ε of H.
source:
H(X)
0.610
sample mean
0.614
sample std
0.222
% typical
52.2%

Each bar is the count of sampled length-n sequences whose per-symbol log-probability lies in that bin. The dashed line marks H(X). As you crank up n, the histogram tightens around H — that is the AEP. Sequences whose rate lands inside the shaded band [H − ε, H + ε] are the ε-typical sequences. The standard deviation should shrink like roughly 1/√n.

Interactive: Typical vs Atypical Sequences

The full sequence space has 2^n elements. The typical set has only ≈ 2^(nH(p)). For a fair coin those numbers agree. As bias grows, the typical set becomes a vanishing sliver of the cube — yet it still carries almost all of the probability.

For a binary source with bias p, the full sequence space has 2^n elements, but the typical set only ≈ 2^(nH(p)). The ratio is exact — the rectangles are area-scaled.

H(p)
0.610
2^n
1,048,576
2^(nH)
4,695
ratio
223×

For a fair coin (p = 1/2) the typical set is the entire sequence space — there is nothing to compress. Push p toward 1 and the typical set becomes a vanishing sliver of the full cube, even though it still carries almost all of the probability. That gap is exactly Shannon's compression budget.

Interactive: Achievable Compression Rate

Sample a long binary sequence with bias p, then encode typical sequences with ⌈n(H + ε)⌉ bits and atypical ones literally with n bits. As n grows, the achieved per-symbol rate converges to H(p) — Shannon's source coding theorem in action.

“Compress” by encoding typical sequences with ⌈n(H + ε)⌉ bits and atypical ones literally with n bits. The achieved rate converges to H(p) — Shannon's source coding theorem.

H(p) limit
0.610
achieved
0.837
% typical
58.4%
overhead
22.7%

Increase n and the achieved rate squeezes down toward H(p), the information-theoretic floor. Tighten ε (smaller band) and the floor is closer, but the typical-set probability shrinks unless n is large enough to compensate. This is exactly the trade-off in Shannon's source coding theorem: for any rate above H, there exists an n large enough that we can encode typical sequences within budget and ignore the vanishing tail.

The math objects

  • The empirical entropy rate: for a sequence x = (x₁, …, x_n) drawn i.i.d. from a source with distribution p, the per-symbol log-probability is −(1/n) log P(x) = −(1/n) Σ log p(x_i). Its expected value is H(X), and the law of large numbers makes it converge to H(X) in probability.
  • The typical set T_ε^(n): all sequences x with |−(1/n) log P(x) − H(X)| ≤ ε. By construction every typical sequence has probability between 2^(−n(H+ε)) and 2^(−n(H−ε)) — they are all roughly equiprobable, hence the name.
  • Two AEP guarantees (Cover & Thomas, Theorem 3.1.2): for every ε > 0 and large enough n, P(X^n ∈ T_ε^(n)) ≥ 1 − ε, and (1 − ε) · 2^(n(H − ε)) ≤ |T_ε^(n)| ≤ 2^(n(H + ε)). Almost all of the probability sits on a set of size approximately 2^(nH).
  • Source coding consequence: any rate R > H(X) is achievable — assign one short codeword to each typical sequence (about ⌈n(H + ε)⌉ bits suffice to address them all) and accept a vanishingly small failure probability on the atypical tail. No rate below H is achievable. This is Shannon's first theorem.
  • Why “equipartition”: within the typical set, sequences differ in probability only by factors of 2^(±nε), which becomes negligible compared to the absolute probability 2^(−nH). For practical purposes the typical set is uniform.

Key takeaways

  • −(1/n) log P(X₁…X_n) converges to H(X) — the per-symbol log-probability is law-of-large-numbers'd to the entropy.
  • Almost all probability mass lives on the typical set, of size ≈ 2^(nH(X)) — much smaller than the full 2^n cube when the source is biased.
  • Within the typical set, every sequence has approximately the same probability ≈ 2^(−nH(X)). That's the “equipartition”.
  • Shannon's source coding theorem follows immediately: H(X) is the optimal per-symbol bit rate, achievable by encoding only typical sequences.
  • The AEP is asymptotic: at small n the histogram is wide. The magic happens when sequences are long enough that the law of large numbers has bitten.