This post is part of a series of tutorials on how one can use implicit models for variational inference. Here's a table of contents so far:

• Part I: Inference of single, global variable (Bayesian logistic regression)
• Part II: Amortised Inference via the Prior-Contrastive Method (Explaining Away Demo)
• Part III: Amortised Inference via a Joint-Contrastive Method (ALI, BiGAN)
• ➡️️Part IV (you are here): Using Denoisers instead of Discriminators

That's right, I went full George Lucas and skipped Part III because (a) it was your homework assignment to write that for me and (b) following up Part II with Part III is relatively boring and predictable and the stuff in this post is way more interesting! This is the rogue one. This is about replacing density ratio estimation (a.k.a. the discriminator) with denoising as the main tool to deal with implicit models.

#### Summary of this note

• I explain how denoisers can be used to estimate the gradients of log-densities, which in turn can be used in a variational algorithm
• I derive simple variational inference algorithms based on denoisers for Bayesian logistic regression and for amortised VI
• I discuss related work and why the reconstruction error should not be used as a substitute for the energy
• finally, I discuss the toy experiments I did in the associated iPython notebook

## Rationale

The key difficulty in using implicit models is that their log density (also known as energy) is unknown. My way to understand GANs is that they use logistic regression to estimate the log density relative to some other distribution. In generative modelling we measure the log density ratio to the target data distribution, in VI to the prior, or between joint distributions. Crucually, training the discriminator only requires samples from the implicit model (and from the contrast distribution) which makes this possible.

Denoising provides another mechanism to learn about the log density of a distribution only requiring samples. Instead of learning a log density ratio, the denoiser function learns the gradient of the log density, also known as the score or score function in statistics. We can then use these gradient estimates with the chain rule to devise an algorithm that maximises or minimises functionals of the log density, such as entropy, mutual information or KL-divergence.

## Derivation

Let's take an implicit probability distribution $q(x; \phi)$ over a $d$-dimensional Euclidean space $\mathbb{R}^d$. Let's say we sample from $q$ by squashing normal random vectors $z$ through a nonlinearity G, so that $z\sim \mathcal{N}(0,I), x=G(z; \phi)$ is the same as writing $x\sim q(x; \psi)$.

Now consider training a denoising function $F:\mathbb{R^d\rightarrow R^d}$ so as to minimise the average mean-squared reconstruction error, that is

$F^{*} = \operatorname{argmin}_F \mathbb{E}_{x\sim q(x; \phi)} \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \sigma^2I)}\|F(x+\epsilon) - x\|^2$

In the formula above, $\epsilon$ is additive isotropic Gaussian noise, $x$ is sampled from $q(x; \phi)$ and $F$ tries to reconstruct the original $x$ from its noise-corrupted version $x+\epsilon$. As it is mentioned e.g. in (Alain and Bengio, 2014), the Bayes-optimal solution to this denoising problem will approach the following approximate solution (as the noise variance $\sigma_n$ decreases) :

$F^{*}(x) \approx x + \sigma_n^2 \frac{\partial \log q(x; \phi)}{\partial x}$

#### Cons

• the theory holds in the limit of infinitesimally small noise, but in practice denoisers are trained on small noise. This may result in blurring some fine details of the distributions, and we have seen in (Sønderby et al, 2016) that samples from such a model do tend to be blurrier. Moreover, the denoising objective relies on sampling, which is inefficient in high dimmensions. To solve both problems, the regularised autoencoder might be a better option.
• the denoiser function is only accurate around the support of the distribution. Far away from samples, the denoising function is never trained and can develop spurious modes. One can address this by taking an ensemble of denoisers, and only accepting a high gradient signal if all ensemble members agree on the direction. But generally speaking, denoising isn't accurate far away from samples, and this is where discrimination might be somewhat better.