Sanyam Kapoor

Knowledge Base

Wine Map

Blog

The Gaussian Cheatsheet

This is a collection of key derivations involving Gaussian distributions which commonly arise almost everywhere in Machine Learning.

Normalizing Constant

A Gaussian distribution with mean μ\mu and variance σ2\sigma^2 is given by

p(x)exp{(xμ)22σ2}p(x) \propto \exp\left\{ - \frac{(x - \mu)^2}{2\sigma^2} \right\}

To derive the normalizing constant for this density, consider the following integral

I=exp{(xμ)22σ2}dx\text{I} = \int_{-\infty}^{\infty} \exp\left\{ - \frac{(x - \mu)^2}{2\sigma^2} \right\} dx
I2=exp{(xμ)22σ2}exp{(yμ)22σ2}dxdy\text{I}^2 = \int_{-\infty}^{\infty} \exp\left\{ - \frac{(x - \mu)^2}{2\sigma^2} \right\} \exp\left\{ - \frac{(y - \mu)^2}{2\sigma^2} \right\} dx dy

First using change of variables u=xμu = x - \mu and v=yμv = y - \mu, we have

I2=exp{u2+v22σ2}dudv\text{I}^2 = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \exp\left\{ - \frac{u^2 + v^2}{2\sigma^2} \right\} du dv

Transforming to polar coordinates u=rcosθu = r \cos{\theta} and v=rsinθv = r\sin{\theta} using the standard change of variables, we require the Jacobian determinant (u,v)(r,θ)\Big| \frac{\partial (u,v)}{\partial (r, \theta)} \Big|

[cosθrsinθsinθrcosθ]=r\Big| \begin{bmatrix} \cos{\theta} & -r\sin{\theta} \\ \sin{\theta} & r\cos{\theta} \end{bmatrix} \Big| = r
I2=02π0exp{r22σ2}rdrdθ=2π0exp{r22σ2}rdr\text{I}^2 = \int_{0}^{2\pi} \int_{0}^{\infty} \exp\left\{ - \frac{r^2}{2\sigma^2} \right\} r dr d\theta = 2\pi \int_{0}^{\infty} \exp\left\{ - \frac{r^2}{2\sigma^2} \right\} r dr

Solving this final integral requires another change of variable. Let z=r22σ2z = \frac{r^2}{2\sigma^2}, hence σ2dz=rdr\sigma^2 dz = r dr.

I2=2πσ20ezdz=2πσ2[ez]0=2πσ2\text{I}^2 = 2 \pi \sigma^2 \int_{0}^{\infty} e^{-z} dz = 2 \pi \sigma^2 \left[ -e^{-z} \right]_{0}^{\infty} = 2\pi\sigma^2

This implies I=(2πσ2)1/2\text{I} = (2\pi\sigma^2)^{1/2} and hence the complete distribution is now written as

p(x)=1(2πσ2)1/2exp{(xμ)22σ2}p(x) = \frac{1}{(2\pi\sigma^2)^{1/2}} \exp\left\{ - \frac{(x - \mu)^2}{2\sigma^2} \right\}

As a Maximum Entropy Distribution

Interestingly, the Gaussian distribution also turns out to be the maximum entropy distribution on the infinite support for a finite second moment [1]. The differential entropy is defined as the expected information H[x]=p(x)logp(x)dx\mathbb{H}[x] = -\int_{-\infty}^{\infty} p(x) \log{p(x)} dx of a random variable xp(x)x \sim p(x).

To find the maximum entropy distribution, we formally write the constrained optimization problem stated before

maxp(x)logp(x)dxs.t.p(x)dx=1xp(x)dx=μ(xμ)2p(x)dx=σ2\begin{aligned} \text{max}& -\int_{-\infty}^{\infty} p(x) \log{p(x)} dx \\ \text{s.t.}& \int_{-\infty}^{\infty} p(x) dx = 1 \\ &\int_{-\infty}^{\infty} x p(x) dx = \mu \\ &\int_{-\infty}^{\infty} (x - \mu)^2 p(x) dx = \sigma^2 \end{aligned}

The constraints correspond to the normalization of probability distributions, finite mean (first moment) and finite variance (finite second moment). This can be converted into an unconstrained optimization problem using Lagrange multipliers [2]. The complete objective becomes

p(x)logp(x)dx+-\int_{-\infty}^{\infty} p(x) \log{p(x)} dx +
λ1(p(x)dx1)+λ2(xp(x)dxμ)+λ3((xμ)2p(x)dxσ2)\lambda_1 \left( \int_{-\infty}^{\infty} p(x) dx - 1 \right) + \lambda_2 \left( \int_{-\infty}^{\infty} x p(x) dx - \mu \right) + \lambda_3 \left( \int_{-\infty}^{\infty} (x - \mu)^2 p(x) dx - \sigma^2 \right)

Setting the functional derivative 1 d[f(p(x))]dp(x)=0\frac{d [f(p(x))] }{d p(x)} = 0, we get

logp(x)+1+λ1+λ2x+λ3(xμ)2=0-\log{p(x)} + 1 + \lambda_1 + \lambda_2 x + \lambda_3 (x - \mu)^2 = 0
p(x)=exp{1+λ1+λ2x+λ3(xμ)2}p(x) = \exp{ \left\{ 1 + \lambda_1 + \lambda_2 x + \lambda_3 (x - \mu)^2 \right\} }

To recover the precise values of the Lagrange multipliers, we substitute them back into the constraints. The derivation is involved but straightforward. We first manipulate the exponent by completing the squares 2 which will allow us to re-use results from the previous section. We also further always make use of the subsitution x=xβ2αx^\prime = x - \frac{\beta}{2\alpha}.

1+λ1+λ2x+λ3(xμ)2=λ3=αx2(2μλ3λ2)=βx+(1+λ1+λ3μ2)=γ1 + \lambda_1 + \lambda_2 x + \lambda_3 (x - \mu)^2 = \underbrace{\lambda_3}_{=\alpha} x^2 - \underbrace{(2\mu\lambda_3 - \lambda_2)}_{=\beta} x + \underbrace{(1 + \lambda_1 + \lambda_3 \mu^2)}_{=\gamma}
p(x)=exp{α(xβ2α)2}exp{12β24αγ2α=δ}p(x) = \exp{ \left\{ \alpha\left(x - \frac{\beta}{2\alpha} \right)^2 \right\} }\exp{ \left\{ -\underbrace{\frac{1}{2}\frac{\beta^2 - 4\alpha \gamma}{2\alpha}}_{= \delta} \right\} }

Putting back these new definitions the normalization constraint is,

exp{α(x)2}dx=exp{δ}\int_{-\infty}^{\infty} \exp{ \left\{ \alpha \left( x^\prime \right)^2 \right\} } dx^\prime = \exp{ \left\{ \delta \right\} }

Using another change of variable y=αxy = \sqrt{-\alpha}x^\prime, we reduce this integral to a familiar form and can evaluate using the polar coordinate transformation trick as earlier.

1αexp{y2}dy=exp{δ}    exp{δ}=πα(1)\begin{aligned} \frac{1}{\sqrt{-\alpha}} \int_{-\infty}^{\infty} \exp{ \left\{ -y^2 \right\} }dy &= \exp{\left\{ \delta \right\}} \\ \implies\exp{ \left\{ \delta \right\} } &= \sqrt{\frac{\pi}{-\alpha}} \tag{1} \end{aligned}

Similarly, we put this into the finite first moment constraint and re-apply the substitution y=αxy = \sqrt{-\alpha}x^\prime. The first term is an integral of an odd function over the full domain and is nullified.

(x+β2α)exp{α(x)2}dx=xexp{α(x)2}dx+β2αexp{α(x)2}=μexp{δ}    exp{δ}=β2μαπα(2)\begin{aligned} \int_{-\infty}^{\infty} (x^\prime + \frac{\beta}{2\alpha}) \exp{\left\{ \alpha (x^\prime)^2 \right\}} dx^\prime &= \cancel{\int_{-\infty}^{\infty} x^\prime \exp{\left\{ \alpha (x^\prime)^2 \right\}} dx^\prime} + \frac{\beta}{2\alpha} \int_{-\infty}^{\infty} \exp{\left\{ \alpha (x^\prime)^2 \right\}} \\ &= \mu\exp{\left\{\delta\right\}} \\ \implies\exp{\left\{ \delta \right\}} &= \frac{\beta}{2\mu\alpha} \sqrt{\frac{\pi}{-\alpha}} \tag{2} \end{aligned}

Combining (1) with (2), we get β=2μα\beta = 2\mu\alpha. Substituting α\alpha and β\beta as defined earlier, we get

λ2=0\lambda_2 = 0

With similar approaches and substitutions, we substitute values in the integral for finite second moment constraint.

(x+β2αμ)2exp{α(x)2}dx=σ2exp{δ}\begin{aligned} \int_{-\infty}^{\infty} \left(x^\prime + \cancel{\frac{\beta}{2\alpha}} - \cancel{\mu} \right)^2 \exp{ \left\{ \alpha \left( x^\prime \right)^2 \right\} } dx^\prime &= \sigma^2 \exp{ \left\{ \delta \right\} } \end{aligned}

Focusing on the remaining term, we first apply the change of variable y=αxy = \sqrt{-\alpha}x^\prime and note that this is an even function. This allows us to use the next change of variables.

(x)2exp{α(x)2}dx=1(α)3/2y2exp{y2}dy\int_{-\infty}^{\infty} (x^\prime)^2 \exp{ \left\{ \alpha \left( x^\prime \right)^2 \right\} } dx^\prime = \frac{1}{(-\alpha)^{3/2}} \int_{-\infty}^{\infty} y^2 \exp{ \left\{ -y^2 \right\} } dy

To allow a change of variable further, we first note that this is an even function and symmetric around 00. Using this knowledge, we can change the limits of integration to positive values and use y2=zy^2 = z

2(α)3/20y2exp{y2}dy=1(α)3/20z1/2exp{z}dz=Γ(3/2)(α)3/2=12πα3\begin{aligned} \frac{2}{(-\alpha)^{3/2}} \int_{0}^{\infty} y^2 \exp{ \left\{ -y^2 \right\} } dy &= \frac{1}{(-\alpha)^{3/2}} \int_{0}^{\infty} z^{1/2} \exp{ \left\{ -z \right\} } dz \\ &= \frac{\Gamma(3/2)}{(-\alpha)^{3/2}} = \frac{1}{2}\sqrt{\frac{\pi}{-\alpha^3}} \end{aligned}

where we utilize the fact that Γ(x+1)=xΓ(x)\Gamma(x + 1) = x\Gamma(x) 3 and Γ(1/2)=π\Gamma(1/2) = \sqrt{\pi}. Plugging everything back and using β=2μα\beta = 2\mu\alpha, we get

12πα3=σ2expδ12πα3=σ2πα12α=σ2\begin{aligned} \frac{1}{2}\sqrt{\frac{\pi}{-\alpha^3}} &= \sigma^2 \exp{\delta} \\ \frac{1}{2}\sqrt{\frac{\pi}{-\alpha^3}} &= \sigma^2 \sqrt{\frac{\pi}{-\alpha}} \\ -\frac{1}{2\alpha} &= \sigma^2 \end{aligned}

Using this, we get

λ3=12σ2\lambda_3 = - \frac{1}{2\sigma^2}

Substituting back in (1), we have

exp{4μ2α24αγ4α}=παexp{μ2λ31λ1λ3μ2}=2πσ2λ1=112log2πσ2\begin{aligned} \exp{ \left\{ \frac{4\mu^2\alpha^2 - 4\alpha\gamma}{4\alpha} \right\} } &= \sqrt{\frac{\pi}{-\alpha}} \\ \exp{ \left\{ \mu^2\lambda_3 - 1 - \lambda_1 - \lambda_3\mu^2 \right\} } &= \sqrt{2\pi\sigma^2} \\ \lambda_1 = -1 - \frac{1}{2}\log{2\pi\sigma^2} \end{aligned}

Substituting λ1,λ2,λ3\lambda_1,\lambda_2,\lambda_3 back into p(x)p(x) gives us the form for p(x)=N(μ,σ2)p(x) = \mathcal{N}(\mu, \sigma^2).

References

  1. Bishop, C.M., 2006. Pattern recognition and machine learning, springer.
  2. Boyd, S., Boyd, S.P. & Vandenberghe, L., 2004. Convex optimization, Cambridge university press.

Footnotes


  1. See Appendix D in [1]
  2. We note that for any general quadratic αx2βx+γ=α(xβ2α)212β24αγ2α\alpha x^2 - \beta x + \gamma = \alpha \left(x - \frac{\beta}{2\alpha} \right)^2 - \frac{1}{2} \frac{\beta^2 - 4\alpha \gamma}{2\alpha}
  3. Γ(x)=0ux1eudu\Gamma(x) = \int_{0}^{\infty} u^{x-1} e^{-u}du is the Gamma function.

Created: Wed Apr 22 2020

© 2020 Sanyam Kapoor