Bayesian Inference

Posterior equals likelihood times prior, normalized. Watch beliefs update as evidence arrives.

Bayesian Inference

Bayesian inference treats unknown parameters as random variables and uses probability to encode what you believe about them. Start with a prior P(θ) — your beliefs before seeing data. Observe data D, and compute the likelihood P(D | θ) — how plausible the data is for each θ. Bayes' rule combines them into the posterior:

P(θ | D) = P(D | θ) · P(θ) / P(D)

The denominator P(D) — the evidence — just normalizes things; the shape of the posterior is set by the numerator. For a handful of special prior–likelihood pairs called conjugate families, the posterior lives in the same family as the prior, and the update is a one-line formula. The two demos below show the two classic cases: Beta–Binomial for proportions, and Normal–Normal for an unknown mean. The third demo shows that a familiar workhorse — ridge regression — is just a MAP estimate in disguise.

Interactive: Beta–Binomial Updater

Coin flips and a Beta prior. Each head bumps α by one, each tail bumps β by one. Watch the posterior tighten as the evidence accumulates and the prior gets washed out.

prior α1.0prior β1.0

observations

0H / 0T

posterior mean

0.5000

posterior mode (MAP)

—

posterior std

0.2887

Each flip moves the posterior parameters by exactly one: a head bumps α, a tail bumps β. With no data the posterior equals the prior. As evidence accumulates the curve narrows and centres on the empirical proportion — the prior's influence dwindles. Try a confident prior like Beta(8, 8) and watch how many flips it takes for the posterior to forget it.

Interactive: Gaussian Posterior over the Mean

Estimating the mean μ of a normal with known variance. Click on the strip to add observations; the posterior is Gaussian with precision equal to prior precision plus n times data precision.

Click anywhere on the strip below to add an observation. The posterior over μ narrows toward the sample mean as more data arrives.

prior μ₀0.0prior σ₀²4.0data σ²1.00

tip: a tight prior σ₀² resists the data; a loose prior lets the data dominate

sample mean x̄

—

posterior mean

0.000

posterior σ

2.000

With known data variance σ², the posterior over μ is Gaussian: the precisions add, and the posterior mean is a precision-weighted average of the prior mean and the sample mean. With one observation, the posterior already shifts noticeably; with many, it pins down μ almost exactly — and you can see the prior's pull fade.

Interactive: MAP vs MLE — Ridge Regression as a Gaussian Prior

The same dataset, fit two ways. MLE picks the line of maximum likelihood (OLS). MAP adds a Gaussian prior on the slope — the result is ridge regression. Slide λ to watch the MAP line bend toward zero.

Click on the canvas to add a data point; click an existing point to remove it. Slide λ to strengthen the Gaussian prior on the slope — the MAP line bends toward zero slope (ridge regression).

log λλ = 1.000

MLE (no prior)

ŷ = 0.359 + 1.383 · x

SSE = 0.538

MAP (Gaussian prior, λ = 1.000)

ŷ = 0.368 + 1.348 · x

SSE = 0.585 · loss + λ‖β‖² = 2.402

MLE picks the parameters that maximize the data likelihood — for Gaussian noise that's ordinary least squares. MAP adds a prior: a zero-mean Gaussian prior on the slope contributes a λ‖β‖² penalty, exactly recovering ridge regression. As λ → 0 the MAP line collapses to MLE; as λ → ∞ it flattens to ȳ. The point: MAP = MLE + regularizer, and the regularizer is a prior in disguise.

The math objects

Prior P(θ): a probability distribution over the parameter space, encoding what you believe before seeing data. A flat prior says you know nothing; a sharp prior encodes a confident guess.
Likelihood P(D | θ): the probability the model assigns to the observed data, viewed as a function of θ with D fixed. Not a probability over θ — that's what makes the posterior different.
Posterior P(θ | D): the prior reweighted by the likelihood and renormalized. It is the answer to “given what I've seen, what should I now believe?”
Conjugate prior: a prior that, paired with a particular likelihood, yields a posterior in the same family. Beta is conjugate to Binomial; Normal (over μ) is conjugate to Normal (over data with known variance). Conjugacy keeps the math closed-form.
MAP estimate: argmax_θ P(θ | D) — the posterior mode. With a flat prior, MAP equals MLE. With a Gaussian prior on linear regression coefficients, MAP equals ridge regression. Many regularizers are priors in disguise: L² penalty ↔ Gaussian prior, L¹ penalty ↔ Laplace prior.
MLE estimate: argmax_θ P(D | θ) — the parameter that makes the data most likely. It ignores the prior, so it overfits when data is scarce.

Key takeaways

Bayes' rule: posterior ∝ likelihood × prior. The evidence is just a normalizer.
Conjugate priors keep the update in closed form — Beta + Binomial and Normal + Normal are the canonical pair.
For Beta–Binomial, the update is literally α ← α + heads, β ← β + tails.
For Normal–Normal with known variance, precisions add: τ_post = τ_prior + n·τ_data.
MAP = MLE + log-prior. Ridge regression is MAP with a Gaussian prior on the coefficients.
As data accumulates, the posterior concentrates and the prior's influence dwindles — Bayesian and frequentist estimates agree in the limit.