Differential Entropy & Maximum Entropy

Continuous entropy h(X) = −∫ f log f. Why the Gaussian is nature's default distribution.

Differential Entropy & Maximum Entropy

Shannon's entropy was defined for discrete random variables. To push the same idea into the continuous world we replace the sum with an integral and arrive at differential entropy: for a random variable X with density f(x), h(X) = −∫ f(x) log f(x) dx. The shape of the formula is identical to its discrete cousin, but its behaviour is meaningfully different — h can be negative, and its value depends on the units you measure x in.

Differential entropy gets interesting through the maximum-entropy principle. Among all densities satisfying a given set of constraints, the one that maximizes h(X) is, in a precise sense, the "least biased" choice — it assumes nothing beyond the constraints themselves. Three classic results follow: the uniform distribution is max-entropy on a bounded interval, the exponential is max-entropy on the positive reals at a fixed mean, and the Gaussian is max-entropy on the whole real line at a fixed mean and variance. That last result is one of the deepest reasons Gaussians appear everywhere — in measurement noise, in statistical models, in the central limit theorem, in physics.

Interactive: The Differential Entropy Explorer

Pick a continuous distribution, drag the parameter sliders, and watch h(X) update in closed form. Toggle between bits and nats. The amber readout flips to red whenever the entropy goes negative — a phenomenon impossible for discrete entropy.
family:
closed form
h(X) = ½ log(2π e σ²)
h(X)
2.0471 bits

The classic max-entropy density on the real line at fixed mean and variance. Push σ below 1/√(2πe) ≈ 0.242 to see h(X) go negative.

Interactive: The Three Canonical Max-Entropy Distributions

Three different constraints, three different answers. Bounded support gives the uniform; a positive variable with fixed mean gives the exponential; a real variable with fixed mean and variance gives the Gaussian. Adjust the constraint parameters and watch each density morph in lock-step.

Three different constraint sets, three different max-entropy distributions. Adjust each panel's constraint and watch its PDF (and entropy) change.

Uniform on [a, b]
constraint: support is [a, b]
h = log(b − a)h = 1.000 bits
Exponential, rate 1/μ
constraint: X ≥ 0 and E[X] = μ
h = 1 + ln μ (nats)h = 1.443 bits
Gaussian N(μ, σ²)
constraint: E[X] = μ, Var(X) = σ²
h = ½ log(2π e σ²)h = 2.047 bits
bounded support
positive with mean μ
rate λ = 1/μ = 1.000
mean μ & variance σ²

Each density is the unique max-entropy choice given its constraint. Replace any of these distributions with a different shape that meets the same constraint, and h(X) will be strictly smaller — that is what "maximum entropy" literally means.

Interactive: Why the Gaussian Wins

Compare a Gaussian, a Laplace, and a uniform distribution — all standardized to mean 0 and variance 1. The Gaussian has the largest differential entropy. Drag the morph slider to interpolate from Gaussian toward Laplace and watch h drop strictly below the Gaussian value.

All three densities have mean 0 and variance 1. The Gaussian wins — on the real line at fixed (μ, σ²), no other density has a higher differential entropy.

Mixture (1 − t)·Gaussian + t·Laplace, dashed white. h(X) = 2.0471 bits. Notice how it stays at or below the pure-Gaussian value.

Gaussian N(0, 1)h = 2.0471 bitsmax
Laplace (μ=0, b=1/√2)h = 1.9427 bits
Uniform [−√3, √3]h = 1.7925 bits

This is the maximum-entropy theorem in concrete form. Constrain the first two moments of a real-valued random variable, and the unique density that maximizes h(X) is the Gaussian. Anything else with the same mean and variance — heavier tails, lighter tails, compact support — has strictly less entropy.

The math objects

  • Probability density f(x): a non-negative function on the real line whose integral is 1. The continuous analogue of a probability mass function.
  • Differential entropy h(X) = −∫ f log f dx: the expected value of −log f(X). It can be negative — for example, a very narrow Gaussian has h(X) < 0 — because f(x) can exceed 1 in places.
  • Maximum-entropy principle: subject to a set of moment constraints, the density that maximizes h(X) is uniquely determined and has the form f(x) ∝ exp(λ₀ + Σᵢ λᵢ Tᵢ(x)) — an exponential family with one Lagrange multiplier per constraint.
  • Closed-form values: h_uniform = log(b − a), h_exp = 1 − ln λ in nats, h_Gaussian = ½ log(2π e σ²), h_Laplace = 1 + ln(2b) in nats. These are easy to remember and cover most of practice.
  • Coordinate dependence: if Y = c X then h(Y) = h(X) + log |c|. So differential entropy depends on the units of measurement — meters versus millimeters changes h by log 1000. This is the price you pay for replacing a sum by an integral.
  • Why it still matters: differential entropy appears everywhere downstream — in mutual information for continuous variables, in the channel capacity of the additive white Gaussian noise channel, and in the entropy power inequality.

Key takeaways

  • Differential entropy h(X) = −∫ f log f dx replaces the sum in Shannon entropy with an integral.
  • Unlike discrete entropy, h(X) can be negative and depends on the units of x.
  • The maximum-entropy principle picks the "least biased" density consistent with a set of constraints.
  • Uniform, exponential, and Gaussian are the canonical max-entropy distributions for bounded support, fixed mean on (0, ∞), and fixed mean & variance on ℝ respectively.
  • The Gaussian's status as the max-entropy distribution at fixed (μ, σ²) is one of the deepest reasons it shows up so often in nature and statistics.