Probabilistic Data Structures

Explore skip lists, Bloom filters, and HyperLogLog—structures that trade exactness for remarkable efficiency using probability

Trading Certainty for Efficiency

Sometimes "probably correct" is good enough — especially when it means using 100× less memory or handling millions of events per second. Probabilistic data structures make this tradeoff explicit and quantifiable.

In this section, you'll explore structures that use randomization to achieve remarkable efficiency: skip lists for O(log n) expected search, Bloom filters for constant-space set membership, Count-Min sketches for stream frequency estimation, and HyperLogLog for counting unique elements with just bytes of memory.

Demo 1: Skip Lists

Skip lists achieve expected O(log n) search by randomly promoting nodes to "express lanes." With probability p, a node appears at each higher level. No rotations needed — just coin flips!

0.5
L0L1L2L3L4HEAD134679121517192125
Searching for 12:Found!(10 steps)
Path: 4 → 4 → 4 → 4 → 4 → 6 → 7 → 9 → 9 → 12
Max Level
5
expected: 3.6
E[Search]
7.2
comparisons
Total Pointers
18
expected: 24.0
Space Overhead
2.00×
n / (1-p)

Level Distribution (100 nodes, p = 0.5)

L0
L1
L2
L3
L4
L5
L6
L7
Geometric distribution: P(level ≥ k) = pk-1
Why p = 0.5?
pE[Search]E[Space]Time × Space
0.120.0111.12222
0.214.3125.01788
0.312.7142.91821
0.412.6166.72094
0.513.3200.02658
0.615.0250.03756
0.718.4333.36148
0.825.8500.012899
0.948.61000.048565

p = 0.5 balances search time and space. Lower p = less space, more search time.

Demo 2: Bloom Filters

A Bloom filter tests set membership using k hash functions and m bits. It may say "probably yes" when the answer is no (false positive), but never says "no" when the answer is yes.

Bit Array (100 bits)

0 bits set (0.0%)0 items added
False Positive Rate
0.00%
current
Optimal k
69
(m/n) × ln(2)
Bits per Element
m/n
With Optimal k
0.00%
best FPR

FPR vs Filter Size

Fewer bitsMore bits → Lower FPR

Key Formulas

False Positive Rate:
p ≈ (1 - e-kn/m)k
Optimal Hash Count:
kopt = (m/n) × ln(2)
For 1% FPR: need ≈ 10 bits per element with k ≈ 7 hash functions
How many bits do I need?

To store n items with FPR of p:

m = -n × ln(p) / ln(2)²
n=1000, p=1%
9,586 bits
n=1M, p=1%
9,585,059 bits
n=1M, p=0.1%
14,377,588 bits

Demo 3: Count-Min Sketch

Process a stream of items and estimate frequencies using constant space. The Count-Min Sketch uses d hash functions and w counters per row, guaranteeing estimates within ε·N of the true count.

Counter Matrix (3 × 10)

h0
0
0
0
0
0
0
0
0
0
0
h1
0
0
0
0
0
0
0
0
0
0
h2
0
0
0
0
0
0
0
0
0
0
Total Count
0
items added
Error Bound ε
0.272
e / 10
Failure Prob δ
0.0498
1 / e^3
Space
120B
10 × 3 × 4

Error Guarantees

With probability ≥ 95.02%, each estimate is at most 0.0 off (ε × N = 0.272 × 0).

For ε=0.1, δ=0.01: need w=28, d=5

Demo 4: HyperLogLog Cardinality

Count unique elements in a stream using just log log n space! HyperLogLog observes the maximum number of leading zeros in hashes — if you've seen a long run of zeros, you've probably seen many items.

The Key Intuition

If you flip fair coins, the probability of seeing k heads before a tail is 1/2k. So if the maximum run of heads you've ever seen is k, you've probably flipped about 2k coins.

HyperLogLog applies this to hash leading zeros: if max leading zeros is k, cardinality ≈ 2k.

6(64 registers)

Registers (64 buckets, 64 bytes)

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
64 empty registersMax value: 0
Estimated
0
linear
Actual
0
unique items
Error
+0.0%
σ = 13.0%

Accuracy

Standard error: 1.04 / √m = 1.04 / √6413.0%

With 64 registers (64 bytes), estimates are typically within ±26% of the true value.

Leading Zeros Demonstration

Hash each item and count leading zeros:

item_0_0.538...
zeros: 0
item_1_0.vvz...
zeros: 0
item_2_0.459...
zeros: 0
item_3_0.xiw...
zeros: 0
item_4_0.fjz...
zeros: 0
item_5_0.sq8...
zeros: 0
item_6_0.v8f...
zeros: 0
item_7_0.j0c...
zeros: 5
item_8_0.oeo...
zeros: 2
item_9_0.o3l...
zeros: 0
Max zeros: 5 → Estimate: 2^5 = 32 (actual: 10)

Demo 5: Hash Function Quality

All these structures depend on good hash functions. Test for uniformity, avalanche effect, and explore the birthday paradoxthat determines collision probability.

Uniformity
PASS
χ² = 90.6
Collisions
900
90.0%
Avalanche
2.1
ideal: 16

Bucket Distribution (100 buckets)

Expected: 10.0 per bucketRange: 4 - 18

Avalanche Effect

A good hash changes ~50% of output bits when any input bit changes.

13% of ideal

Birthday Paradox

365
50% Collision After
23 items
≈ 1.18 × √365
Formula
n ≈ √(2m ln(2))
≈ 1.177√m
n=150% at n ≈ 23→ 100%
Compare All Hash Functions
FunctionUniformityCollisionsAvalanche
naiveOK9552.0
djb2OK9002.1
fnv1aFAIL9007.7
javaStyleOK9002.0

Probabilistic Toolbox Complete!

You now command powerful approximate data structures:

  • Skip lists — O(log n) expected search with simple randomization
  • Bloom filters — constant space for set membership (no false negatives)
  • Count-Min Sketch — stream frequency with bounded error
  • HyperLogLog — cardinality estimation with just bytes of memory
  • Birthday paradox — predicts when collisions become likely

Next: We'll study amortized analysis — how to prove that expensive operations "pay for themselves" using potential functions.