# Variational Inference using Implicit Models Part IV: Denoisers instead of Discriminators

This post is part of a series of tutorials on how one can use implicit models for variational inference. Here's a table of contents so far:

- Part I: Inference of single, global variable (Bayesian logistic regression)
- Part II: Amortised Inference via the Prior-Contrastive Method (Explaining Away Demo)
- Part III: Amortised Inference via a Joint-Contrastive Method (ALI, BiGAN)
- ➡️️Part IV (you are here): Using Denoisers instead of Discriminators

That's right, I went full George Lucas and skipped Part III because (a) it was your homework assignment to write that for me and (b) following up Part II with Part III is relatively boring and predictable and the stuff in this post is way more interesting! This is the rogue one. This is about replacing density ratio estimation (a.k.a. the discriminator) with *denoising* as the main tool to deal with implicit models.

#### Summary of this note

- I explain how denoisers can be used to estimate the gradients of log-densities, which in turn can be used in a variational algorithm
- I derive simple variational inference algorithms based on denoisers for Bayesian logistic regression and for amortised VI
- I discuss related work and why the reconstruction error should not be used as a substitute for the energy
- finally, I discuss the toy experiments I did in the associated iPython notebook

## Rationale

The key difficulty in using implicit models is that their log density (also known as energy) is unknown. My way to understand GANs is that they use logistic regression to estimate the log density relative to some other distribution. In generative modelling we measure the log density ratio to the target data distribution, in VI to the prior, or between joint distributions. Crucually, training the discriminator only requires samples from the implicit model (and from the contrast distribution) which makes this possible.

Denoising provides another mechanism to learn about the log density of a distribution only requiring samples. Instead of learning a log density ratio, the denoiser function learns the gradient of the log density, also known as the score or score function in statistics. We can then use these gradient estimates with the chain rule to devise an algorithm that maximises or minimises functionals of the log density, such as entropy, mutual information or KL-divergence.

## Derivation

Let's take an implicit probability distribution $q(x; \phi)$ over a $d$-dimensional Euclidean space $\mathbb{R}^d$. Let's say we sample from $q$ by squashing normal random vectors $z$ through a nonlinearity G, so that $z\sim \mathcal{N}(0,I), x=G(z; \phi)$ is the same as writing $x\sim q(x; \psi)$.

Now consider training a denoising function $F:\mathbb{R^d\rightarrow R^d}$ so as to minimise the average mean-squared reconstruction error, that is

$
F^{*} = \operatorname{argmin}_F \mathbb{E}_{x\sim q(x; \phi)} \mathbb{E}_{\epsilon \sim \mathcal{N}(0, \sigma^2I)}\|F(x+\epsilon) - x\|^2

$

In the formula above, $\epsilon$ is additive isotropic Gaussian noise, $x$ is sampled from $q(x; \phi)$ and $F$ tries to reconstruct the original $x$ from its noise-corrupted version $x+\epsilon$. As it is mentioned e.g. in (Alain and Bengio, 2014), the Bayes-optimal solution to this denoising problem will approach the following approximate solution (as the noise variance $\sigma_n$ decreases) :

$
F^{*}(x) \approx x + \sigma_n^2 \frac{\partial \log q(x; \phi)}{\partial x}

$

Note that, of course, the optimal denoising behaviour depends on the data generating distribution q(x; \phi). Hence, once we trained a near-optimal denoising function that is close to Bayes-optimum, we can extract from it an estimate to the score $\frac{\partial \log q(x; \phi)}{\partial x}$. In turn, we can use these score estimates to estimate the gradient of $q