Machine Learning is all about dealing with uncertainty of outcomes and Bayesian inference provides us a principled way to reason about the same. We combine the observed data with priors to build (potentially complex) posteriors over the variables of interest and use those for answering subsequent questions. The ability to model a probability distribution called the posterior allows us to quantify the uncertainty claims for any downstream tasks.

$\overbrace{p(\Theta|\mathbf{X})}^{\text{posterior}} = \frac{\overbrace{p(\mathbf{X}|\Theta)}^{\text{likelihood}}\overbrace{p(\Theta)}^{\text{prior}}}{\underbrace{P(\mathbf{X})}_{\text{evidence}}}$

However, the posterior is a tricky term to compute because of the computationally intensive integral term in the denominator (the evidence). Without having access to the true closed form of the posterior, the next best thing to have is a representative set (more formally the *typical set* [2] from the posterior distributions' *high* density regions. Over the years, researchers have developed a handful of umbrella techniques, all of which converge to the true distribution in the limit of the number of samples - conjugate priors for mathematical convenience [3], Markov Chain Monte Carlo (MCMC) family of algorithms especially the Hamiltonian Monte Carlo (HMC) [1, 4], Variational Inference [7, 5] via approximate posteriors, Normalizing Flows [6] for deterministic transforms. All these methods have been immensely successful in modern Bayesian analysis.

However, like all feasible methods, these make trade offs - using conjugate priors limits the richness of the posterior densities and restricts us to only mathematically convenient likelihood-prior combinations; the MCMC family requires us to carefully design the transition kernel of the Markov chain and the SDE integrator for numerical stability; Variational Inference can lead to posteriors that under-estimate the support of the true distribution because of the nature of forward KL divergence optimization; Normalizing Flows demand an exact parametric form of the bijectors where the computation of the determinant of the Jacobian needs to be tractable. Today we will look at an approach which does away with all the previous pathologies (not to say it doesn't introduce new ones) called the kernelized Stein Gradient.

We will first summarize the key background topics needed to understand, state the formal results underlying the Stein gradient and then visualize via code how this technique can be used to build powerful samplers.

Kernels are a very neat idea where the objective is to map our raw input space $\mathcal{X}$ to an abstract vector space $\mathcal{H}$ so that we can define similarity in terms of dot products and get all the geometric niceties of angles, lengths and distances [8]. Formally, the most important equations can be summarized as

$\begin{aligned}
\mathbf{\Phi}:& \mathcal{X} \to \mathcal{H} \\
&\mathcal{x} \mapsto \mathbf{x} \triangleq \mathbf{\Phi}(\mathcal{x}) \\
k(x, x^\prime) &= \langle \mathbf{\Phi}(\mathcal{x}), \mathbf{\Phi}(\mathcal{x}^\prime) \rangle
\end{aligned}$

The key idea is that given a kernel function, we can construct a feature space (more formally known as the *Hilbert space* in functional analysis) such that the kernel computes a dot product in that feature space. The terminology kernel comes from a function $k$ which gives rise to the following integral operator $T_k$

$(T_kf)(x) = \int_{\mathcal{X}} k(x, x^\prime) f(x) dx^\prime$

Intuitively, one can imagine this operator to be a rich representation in terms of the linear combination over the full
space where the combination coefficients are defined by the kernel function. By extending this idea and a particular construction, we arrive at the idea of *representer of evaluation* which leads to the classic notion of *reproducing kernel Hilbert space (RKHS)*.

Let $p(x)$ be a smooth density supported on $\mathcal{X} \subseteq \mathbb{R}^d$ and $\mathbf{\phi}(x)$ is a smooth vector function, the Stein's identity states that under a Stein operator $\mathcal{A}_p$,

$\begin{aligned}
\mathbb{E}_{x \sim p}\left[ \mathcal{A}_p\phi(x) \right] &= 0 \\
\text{where}, \mathcal{A}_p\phi(x) &= \phi(x) \nabla_x \log{p(x)}^T + \nabla_x \phi(x)
\end{aligned}$

We need mild boundary conditions for the above to be true - either $p(x)\phi(x) = 0\text{ }\forall x \in \partial\mathcal{X}$ when $\mathcal{X}$ is compact or $\lim_{||x|| \to \infty}p(x)\phi(x) =0$ when $\mathcal{X} = \mathbb{R}^d$. A function $\phi$ belongs to the Stein class of $p$ if the Stein identity holds.

Let us consider another smooth density $q(x)$ and $\phi$ belongs to the Stein class of $q$ (but not $p$). After some simple manipulations, we can see that

$\begin{aligned}
\mathbb{E}_q\left[\mathcal{A}_p\phi(x)\right] &= \mathbb{E}_q\left[\mathcal{A}_p\phi(x) - \mathcal{A}_q\phi(x)\right] \\
&= \mathbb{E}_q\left[ (\nabla_x \log{p(x)} - \nabla_x \log{q(x)}) \phi(x)^T \right]
\end{aligned}$

Intuitively, this can be seen as a function weighted by the difference of the score functions of both the distributions. It has been shown [9] that

$\mathbb{E}_q\left[ (\nabla_x \log{p(x)} - \nabla_x \log{q(x)}) \phi(x)^T \right] = \mathbb{E}_q\left[trace(\mathcal{A}_p\phi(x))\right]$

Note that the use of trace for the matrix norm is only meant to make it a scalar and other matrix norms may be considered. If we consider the maximum value over some family of test functions, we get the *Stein Discrepancy* measure.

$\mathbb{S}(q, p) = \underset{\phi \in \mathcal{F}}{\max} \left\{ (\mathbb{E}_q\left[trace(\mathcal{A}_p\phi(x))\right])^2 \right\}$

This can be thought of as a maximum possible violation under the family of test functions away from $p$. However, it turns out that the computational tractability of this measure is critically dependent on the choice of family of test functions $\mathcal{F}$ (and mostly not possible).

If we choose the family of test functions to be the unit norm RKHS, it turns out we can find out a closed form for the Stein Discrepancy measure [10].

$\begin{aligned}
\mathbb{S}(q, p) &= \underset{\phi \in \mathcal{H}^d}{\max} \left\{ (\mathbb{E}_q\left[trace(\mathcal{A}_p\phi(x))\right])^2, ||\phi||_{\mathcal{H}^d} \leq 1 \right\} \\
&= ||\mathbf{\phi}^\star_{q,p}||_{\mathcal{H}^d}^2 \\
\text{where, } \mathbf{\phi}^\star_{q,p}(\cdot) &= \mathbb{E}_q\left[ \mathcal{A}_p k(x, \cdot) \right]
\end{aligned}$

This measure is zero if and only if $q = p$. The power of this result is realized by the *Stein Variational Gradient Descent*.

As it turns out, under a deterministic transform (a one step normalizing flow) $z = \mathbf{T}(x) = x + \epsilon \mathbf{\phi}(x), \text{where } x \sim q(x)$, we have [11].

$\nabla_\epsilon KL(q_{[\mathbf{T}]} || p) \bigg|_{\epsilon = 0} = - \mathbb{E}_q\left[trace(\mathcal{A}_p\phi(x))\right]$

The distribution of $q_{[\mathbf{T}]}$ can be given by the change of variable formula for probability distributions

$q_{[\mathbf{T}]}(z) = q(\mathbf{T}^{-1}(z))\cdot \det{\left|\nabla_z\mathbf{T}^{-1}(z)\right|}$

Combining this result with the previous discussion, we can conclude that $\phi^\star_{q,p}$ is the optimal direction of perturbation in the unit norm RKHS. This is the direction of the steepest descent that minimizes the KL divergence of the transformed distribution $q_{[\mathbf{T}]}$ in the zero centered ball of $\mathcal{H}^d$ and the magnitude of change is $\nabla_\epsilon KL(q_{[\mathbf{T}]} || p) \bigg|_{\epsilon = 0} = - \mathbb{S}(q, p)$

In practice, making this identity perturbation transform with every timestep $\epsilon$ brings the KL divergence down by a factor of $\epsilon \mathbb{S}(q,p)$. If we keep running this long enough, we should eventually converge to the true distribution $p$. Therefore, the ODE we are trying to simulate to convergence is

$\dot{x} = \phi^\star_{q,p}(x)$

$\phi^\star_{q,p}$ is an expectation and can be empirically estimated using a mean of $n$ *particles*. Before we go and see how this works in practice, it is important to see what the term $\phi^\star_{q,p}$ is really achieving

$\hat{\phi}^\star_{q,p}(x) = \frac{1}{n} \sum_{j = 1}^n \left[ k(x_j, x) \nabla_{x_j} \log{p(x_j)} + \nabla_{x_j} k(x_j, x) \right]$

If we consider just one particle $n = 1$ and all kernels where $\nabla_xk(x,x) =0$ (which is true for the *rbf* kernel), then what we achieve is plain old gradient descent and the particle would simply reach the mode of $p$. In the presence of more particles, the kernel acts like a repulsive force which encourage diversity of particles. This is actually a pretty neat result where we have established sort of a communication protocol between the particles via the gradient of the kernel function.

The experiments can be run via the Jupyter notebook. Click the badge above. Here are some results to show you the power of Stein Gradient.

Stein particles on Gaussian Distribution

Stein particles on Mixture of Two Gaussians

Stein particles on Mixture of Six Gaussians

MAP behavior with one particle on a Mixture of Six Gaussians (the particle may fall into any of the modes depending on initial position)

All these results use the *rbf* kernel with the median bandwidth heuristic for a total of 1000 gradient steps using Adam (the original work [11] uses *Adagrad* but it should be noted we can use any adaptive gradient descent scheme). See the notebook for more details.

It is time to go back to the fundamental equation in Bayesian learning - the Bayes theorem

$\overbrace{p(\Theta|\mathbf{X})}^{\text{posterior}} = \frac{\overbrace{p(\mathbf{X}|\Theta)}^{\text{likelihood}}\overbrace{p(\Theta)}^{\text{prior}}}{\underbrace{P(\mathbf{X})}_{\text{evidence}}}$

All the methods mentioned in the introduction make some or the other assumption about the parametric nature to get a tractable posterior. Using the Stein gradient, we are in a position to be non-parametric about the posterior. Additionally, we can work with an unnormalized density because the score function does not depend on the normalized constant during the simulation of the ODE described above - $\nabla_x \log{p(x)} = \nabla_x \log{\tilde{p}(x)} - \nabla_x \log{Z} = \log{\tilde{p}(x)}$. An example of this was seen in the experiments when we used the Mixture of Gaussians which were unnormalized. It should however be noted that we may need to introduce a stochastic gradient if the likelihood term is costly to evaluate just like in the Stochastic Hamiltonian Monte Carlo gradient. Overall, this is great news and an exciting approach to dig into!

- Betancourt, M. et al., 2017. The geometric foundations of hamiltonian monte carlo. Bernoulli, 23(4A), pp.2257–2298.
- Brooks, S. et al., 2011. Handbook of markov chain monte carlo, CRC press. Available at: http://www.mcmchandbook.net/.
- Gelman, A. et al., 2013. Bayesian data analysis, Chapman and Hall/CRC.
- Gorham, J. & Mackey, L., 2015. Measuring sample quality with Stein’s method. In Advances in Neural Information Processing Systems. pp. 226–234.
- Jordan, M.I. et al., 1999. An introduction to variational methods for graphical models. Machine learning, 37(2), pp.183–233.
- Liu, Q., Lee, J. & Jordan, M., 2016. A kernelized Stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning. pp. 276–284.
- Liu, Q. & Wang, D., 2016. Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems. pp. 2378–2386.
- Neal, R.M., 2012. MCMC using Hamiltonian dynamics, ArXiv e-prints, June.
- Ranganath, R., Gerrish, S. & Blei, D., 2014. Black box variational inference. In Artificial Intelligence and Statistics. pp. 814–822.
- Rezende, D.J. & Mohamed, S., 2015. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770.
- Scholkopf, B. & Smola, A.J., 2001. Learning with kernels: support vector machines, regularization, optimization, and beyond, MIT press.

Created: Sun Mar 17 2019