Machine Learning is all about dealing with uncertainty of outcomes and Bayesian inference provides us a principled way to reason about the same. We combine the observed data with priors to build (potentially complex) posteriors over the variables of interest and use those for answering subsequent questions. The ability to model a probability distribution called the posterior allows us to quantify the uncertainty claims for any downstream tasks.

However, the posterior is a tricky term to compute because of the computationally intensive integral term in the denominator (the evidence). Without having access to the true closed form of the posterior, the next best thing to have is a representative set (more formally the *typical set* *high* density regions. Over the years, researchers have developed a handful of umbrella techniques, all of which converge to the true distribution in the limit of the number of samples - conjugate priors for mathematical convenience

However, like all feasible methods, these make trade offs - using conjugate priors limits the richness of the posterior densities and restricts us to only mathematically convenient likelihood-prior combinations; the MCMC family requires us to carefully design the transition kernel of the Markov chain and the SDE integrator for numerical stability; Variational Inference can lead to posteriors that under-estimate the support of the true distribution because of the nature of forward KL divergence optimization; Normalizing Flows demand an exact parametric form of the bijectors where the computation of the determinant of the Jacobian needs to be tractable. Today we will look at an approach which does away with all the previous pathologies (not to say it doesn’t introduce new ones) called the kernelized Stein Gradient.

We will first summarize the key background topics needed to understand, state the formal results underlying the Stein gradient and then visualize via code how this technique can be used to build powerful samplers.

Kernels are a very neat idea where the objective is to map our raw input space

The key idea is that given a kernel function, we can construct a feature space (more formally known as the *Hilbert space* in functional analysis) such that the kernel computes a dot product in that feature space. The terminology kernel comes from a function

Intuitively, one can imagine this operator to be a rich representation in terms of the linear combination over the full
space where the combination coefficients are defined by the kernel function. By extending this idea and a particular construction, we arrive at the idea of *representer of evaluation* which leads to the classic notion of *reproducing kernel Hilbert space (RKHS)*.

Let

We need mild boundary conditions for the above to be true - either

Let us consider another smooth density

Intuitively, this can be seen as a function weighted by the difference of the score functions of both the distributions. It has been shown

Note that the use of trace for the matrix norm is only meant to make it a scalar and other matrix norms may be considered. If we consider the maximum value over some family of test functions, we get the *Stein Discrepancy* measure.

This can be thought of as a maximum possible violation under the family of test functions away from

If we choose the family of test functions to be the unit norm RKHS, it turns out we can find out a closed form for the Stein Discrepancy measure

This measure is zero if and only if *Stein Variational Gradient Descent*.

As it turns out, under a deterministic transform (a one step normalizing flow)

The distribution of

Combining this result with the previous discussion, we can conclude that

In practice, making this identity perturbation transform with every timestep

*particles*. Before we go and see how this works in practice, it is important to see what the term

If we consider just one particle *rbf* kernel), then what we achieve is plain old gradient descent and the particle would simply reach the mode of

The experiments can be run via the Jupyter notebook. Click the badge above. Here are some results to show you the power of Stein Gradient.

All these results use the *rbf* kernel with the median bandwidth heuristic for a total of 1000 gradient steps using Adam (the original work*Adagrad* but it should be noted we can use any adaptive gradient descent scheme). See the notebook for more details.

It is time to go back to the fundamental equation in Bayesian learning - the Bayes theorem

All the methods mentioned in the introduction make some or the other assumption about the parametric nature to get a tractable posterior. Using the Stein gradient, we are in a position to be non-parametric about the posterior. Additionally, we can work with an unnormalized density because the score function does not depend on the normalized constant during the simulation of the ODE described above -

If you see mistakes or want to suggest changes, please create issues or revisions against source.