Long sequences from a source are almost certainly typical — and almost equally probable.
Take an i.i.d. source X with entropy H(X) and look at a very long sample sequence X₁, X₂, …, X_n. The Asymptotic Equipartition Property — the AEP — says that almost every such sequence has essentially the same probability: roughly 2^(−nH(X)). Equivalently, the per-symbol log-probability concentrates on H(X):
−(1/n) log P(X₁…X_n) → H(X) almost surely as n → ∞. This is just the law of large numbers applied to the random variable −log p(X), whose expected value is exactly the entropy.
The consequence is dramatic. Out of the 2^n possible binary sequences of length n, almost all of the probability mass sits on a much smaller subset — the typical set T_ε^(n) of size ≈ 2^(nH(X)). When the source is biased, that subset is a tiny sliver of the full cube. This is the mathematical reason compression works: long messages from a structured source live on a vanishingly small fraction of the possible sequence space, and we only need enough bits to address that small fraction.
Each bar is the count of sampled length-n sequences whose per-symbol log-probability lies in that bin. The dashed line marks H(X). As you crank up n, the histogram tightens around H — that is the AEP. Sequences whose rate lands inside the shaded band [H − ε, H + ε] are the ε-typical sequences. The standard deviation should shrink like roughly 1/√n.
For a binary source with bias p, the full sequence space has 2^n elements, but the typical set only ≈ 2^(nH(p)). The ratio is exact — the rectangles are area-scaled.
For a fair coin (p = 1/2) the typical set is the entire sequence space — there is nothing to compress. Push p toward 1 and the typical set becomes a vanishing sliver of the full cube, even though it still carries almost all of the probability. That gap is exactly Shannon's compression budget.
“Compress” by encoding typical sequences with ⌈n(H + ε)⌉ bits and atypical ones literally with n bits. The achieved rate converges to H(p) — Shannon's source coding theorem.
Increase n and the achieved rate squeezes down toward H(p), the information-theoretic floor. Tighten ε (smaller band) and the floor is closer, but the typical-set probability shrinks unless n is large enough to compensate. This is exactly the trade-off in Shannon's source coding theorem: for any rate above H, there exists an n large enough that we can encode typical sequences within budget and ignore the vanishing tail.