Probabilistic Data Structures

Explore skip lists, Bloom filters, and HyperLogLog—structures that trade exactness for remarkable efficiency using probability

Trading Certainty for Efficiency

Sometimes "probably correct" is good enough — especially when it means using 100× less memory or handling millions of events per second. Probabilistic data structures make this tradeoff explicit and quantifiable.

In this section, you'll explore structures that use randomization to achieve remarkable efficiency: skip lists for O(log n) expected search, Bloom filters for constant-space set membership, Count-Min sketches for stream frequency estimation, and HyperLogLog for counting unique elements with just bytes of memory.

Demo 1: Skip Lists

Skip lists achieve expected O(log n) search by randomly promoting nodes to "express lanes." With probability p, a node appears at each higher level. No rotations needed — just coin flips!

p (promotion prob):0.5

Search for:

Searching for 12:Found!(10 steps)

Path: 4 → 4 → 4 → 4 → 4 → 6 → 7 → 9 → 9 → 12

Max Level

expected: 3.6

E[Search]

7.2

comparisons

Total Pointers

expected: 24.0

Space Overhead

2.00×

n / (1-p)

Level Distribution (100 nodes, p = 0.5)

Geometric distribution: P(level ≥ k) = p^k-1

Why p = 0.5?

p	E[Search]	E[Space]	Time × Space
0.1	20.0	111.1	2222
0.2	14.3	125.0	1788
0.3	12.7	142.9	1821
0.4	12.6	166.7	2094
0.5	13.3	200.0	2658
0.6	15.0	250.0	3756
0.7	18.4	333.3	6148
0.8	25.8	500.0	12899
0.9	48.6	1000.0	48565

p = 0.5 balances search time and space. Lower p = less space, more search time.

Demo 2: Bloom Filters

A Bloom filter tests set membership using k hash functions and m bits. It may say "probably yes" when the answer is no (false positive), but never says "no" when the answer is yes.

Filter size (m): 100 bits

Hash functions (k): 69

Auto

Bit Array (100 bits)

0 bits set (0.0%)0 items added

False Positive Rate

0.00%

current

Optimal k

(m/n) × ln(2)

Bits per Element

∞

m/n

With Optimal k

0.00%

best FPR

FPR vs Filter Size

Fewer bitsMore bits → Lower FPR

Key Formulas

False Positive Rate:

p ≈ (1 - e^-kn/m)^k

Optimal Hash Count:

k_opt = (m/n) × ln(2)

For 1% FPR: need ≈ 10 bits per element with k ≈ 7 hash functions

How many bits do I need?

To store n items with FPR of p:

m = -n × ln(p) / ln(2)²

n=1000, p=1%

9,586 bits

n=1M, p=1%

9,585,059 bits

n=1M, p=0.1%

14,377,588 bits

Demo 3: Count-Min Sketch

Process a stream of items and estimate frequencies using constant space. The Count-Min Sketch uses d hash functions and w counters per row, guaranteeing estimates within ε·N of the true count.

Width (w): 10 counters

Depth (d): 3 hash functions

Counter Matrix (3 × 10)

Total Count

items added

Error Bound ε

0.272

e / 10

Failure Prob δ

0.0498

1 / e^3

Space

120B

10 × 3 × 4

Error Guarantees

With probability ≥ 95.02%, each estimate is at most 0.0 off (ε × N = 0.272 × 0).

For ε=0.1, δ=0.01: need w=28, d=5

Demo 4: HyperLogLog Cardinality

Count unique elements in a stream using just log log n space! HyperLogLog observes the maximum number of leading zeros in hashes — if you've seen a long run of zeros, you've probably seen many items.

The Key Intuition

If you flip fair coins, the probability of seeing k heads before a tail is 1/2^k. So if the maximum run of heads you've ever seen is k, you've probably flipped about 2^k coins.

HyperLogLog applies this to hash leading zeros: if max leading zeros is k, cardinality ≈ 2^k.

Precision (b):6(64 registers)

Registers (64 buckets, 64 bytes)

64 empty registersMax value: 0

Estimated

linear

Actual

unique items

Error

+0.0%

σ = 13.0%

Accuracy

Standard error: 1.04 / √m = 1.04 / √64 ≈ 13.0%

With 64 registers (64 bytes), estimates are typically within ±26% of the true value.

Leading Zeros Demonstration

Hash each item and count leading zeros:

item_0_0.538...

zeros: 0

item_1_0.vvz...

zeros: 0

item_2_0.459...

zeros: 0

item_3_0.xiw...

zeros: 0

item_4_0.fjz...

zeros: 0

item_5_0.sq8...

zeros: 0

item_6_0.v8f...

zeros: 0

item_7_0.j0c...

zeros: 5

item_8_0.oeo...

zeros: 2

item_9_0.o3l...

zeros: 0

Max zeros: 5 → Estimate: 2^5 = 32 (actual: 10)

Demo 5: Hash Function Quality

All these structures depend on good hash functions. Test for uniformity, avalanche effect, and explore the birthday paradoxthat determines collision probability.

Sample Size: 1000

Buckets: 100

Uniformity

PASS

χ² = 90.6

Collisions

900

90.0%

Avalanche

2.1

ideal: 16

Bucket Distribution (100 buckets)

Expected: 10.0 per bucketRange: 4 - 18

Avalanche Effect

A good hash changes ~50% of output bits when any input bit changes.

13% of ideal

Birthday Paradox

Hash space size:365

50% Collision After

23 items

≈ 1.18 × √365

Formula

n ≈ √(2m ln(2))

≈ 1.177√m

n=150% at n ≈ 23→ 100%

Compare All Hash Functions

Function	Uniformity	Collisions	Avalanche
naive	OK	955	2.0
djb2	OK	900	2.1
fnv1a	FAIL	900	7.7
javaStyle	OK	900	2.0

Probabilistic Toolbox Complete!

You now command powerful approximate data structures:

Skip lists — O(log n) expected search with simple randomization
Bloom filters — constant space for set membership (no false negatives)
Count-Min Sketch — stream frequency with bounded error
HyperLogLog — cardinality estimation with just bytes of memory
Birthday paradox — predicts when collisions become likely

Next: We'll study amortized analysis — how to prove that expensive operations "pay for themselves" using potential functions.