Explore skip lists, Bloom filters, and HyperLogLog—structures that trade exactness for remarkable efficiency using probability
Sometimes "probably correct" is good enough — especially when it means using 100× less memory or handling millions of events per second. Probabilistic data structures make this tradeoff explicit and quantifiable.
In this section, you'll explore structures that use randomization to achieve remarkable efficiency: skip lists for O(log n) expected search, Bloom filters for constant-space set membership, Count-Min sketches for stream frequency estimation, and HyperLogLog for counting unique elements with just bytes of memory.
Skip lists achieve expected O(log n) search by randomly promoting nodes to "express lanes." With probability p, a node appears at each higher level. No rotations needed — just coin flips!
| p | E[Search] | E[Space] | Time × Space |
|---|---|---|---|
| 0.1 | 20.0 | 111.1 | 2222 |
| 0.2 | 14.3 | 125.0 | 1788 |
| 0.3 | 12.7 | 142.9 | 1821 |
| 0.4 | 12.6 | 166.7 | 2094 |
| 0.5 | 13.3 | 200.0 | 2658 |
| 0.6 | 15.0 | 250.0 | 3756 |
| 0.7 | 18.4 | 333.3 | 6148 |
| 0.8 | 25.8 | 500.0 | 12899 |
| 0.9 | 48.6 | 1000.0 | 48565 |
p = 0.5 balances search time and space. Lower p = less space, more search time.
A Bloom filter tests set membership using k hash functions and m bits. It may say "probably yes" when the answer is no (false positive), but never says "no" when the answer is yes.
To store n items with FPR of p:
Process a stream of items and estimate frequencies using constant space. The Count-Min Sketch uses d hash functions and w counters per row, guaranteeing estimates within ε·N of the true count.
With probability ≥ 95.02%, each estimate is at most 0.0 off (ε × N = 0.272 × 0).
Count unique elements in a stream using just log log n space! HyperLogLog observes the maximum number of leading zeros in hashes — if you've seen a long run of zeros, you've probably seen many items.
If you flip fair coins, the probability of seeing k heads before a tail is 1/2k. So if the maximum run of heads you've ever seen is k, you've probably flipped about 2k coins.
HyperLogLog applies this to hash leading zeros: if max leading zeros is k, cardinality ≈ 2k.
Standard error: 1.04 / √m = 1.04 / √64 ≈ 13.0%
With 64 registers (64 bytes), estimates are typically within ±26% of the true value.
Hash each item and count leading zeros:
All these structures depend on good hash functions. Test for uniformity, avalanche effect, and explore the birthday paradoxthat determines collision probability.
A good hash changes ~50% of output bits when any input bit changes.
| Function | Uniformity | Collisions | Avalanche |
|---|---|---|---|
| naive | OK | 955 | 2.0 |
| djb2 | OK | 900 | 2.1 |
| fnv1a | FAIL | 900 | 7.7 |
| javaStyle | OK | 900 | 2.0 |
You now command powerful approximate data structures:
Next: We'll study amortized analysis — how to prove that expensive operations "pay for themselves" using potential functions.