# Choice of Recognition Models in VAEs: a regularisation view.

This post is based on a conversation I had recently about the importance of the flexibility of approximate posteriors in variational inference. Many people know that your choice approximate posterior model class in VAEs biases your learning. But not everybody seems to be aware of the full intuition behind this.

## Quick recap

#### Most of you can skip this

I want to make this accessible to people with intermediate knowledge, so I'll start with a recap of what variational inference and learning are for, and how it works.

VI is useful for dealing with latent variable models. Let's assume that for each observation $x$ we assign a hidden variable $z$. Our model $p_\theta$ describes the joint distribution between $x$ and $z$. In such a model, typically:

- $p_\theta(z)$ is very easy π£,
- $p_\theta(x\vert z)$ is easy πΉ,
- $p_\theta(x, z)$ is easy π¨,
- $p_\theta(x)$ is super-hard π,
- $p_\theta(z\vert x)$ is mega-hard π²

to evaluate. Unfortunately, in machine learning the things we want to calculate are exactly the bad guys, π and π²:

*inference*is evaluating $p_\theta(z\vert x)$ (π²).*learning*(via maximum likelihood) involves $p_\theta(x)$ (π).

Variational lower bounds give us ways to approximately perform both inference and maximum likelihood parameter learning, by approximating the posterior π² with a simpler, tamer distribution, $q_\psi(z\vert x)$ (π°), called the approximate posterior or recognition model. Variational inference and learning involves maximising the evidence lower bound (ELBO):

$$ \operatorname{ELBO}(\theta, \psi) = \sum_n \log p(x_n) - \operatorname{KL}\left[q_\psi(z\vert x_n)\| p_\theta(z\vert x_n)\right] $$

or

$$ πͺ = \sum_n \log π - \operatorname{KL}[π°\|π²] $$

This expression is still full of π¦s and π²s but the nice thing about it is that it can be written in more convenient forms which only contain the good guys π£, πΉ, π¨ and π°:

\begin{align} πͺ &= - \sum_n \mathbb{E}_π° \log\frac{π°}{π¨} + \text{constant} \\ &= \sum_n \mathbb{E}_π° \log πΉ - \sum_n \mathbb{E}_π° \operatorname{KL}[π°\|π£] \end{align}

The first line is what I refer to as the *joint contrastive* expression, the second I call *prior-contrastive*. See more on this in this series. Both expressions only contain nice, tame distributions and do not need explicit evaluation of either the marginal likelihood π or the posterior π².

ELBO is - as the name suggests - a lower bound to the model evidence or log likelihood. Therefore, maximising it with respect to $\theta$ and $\psi$ approximates maximum likelihood learning, while you can use the recognition model π° instead of π to perform tractable approximate inference.

## What is the effect of the approximate class?

The evidence lower bound is a function of both the model parameters $\theta$ and the recognition model parameters $\psi$. If you want to reason about how variational learning behaves, we can take the point-wise maximum with respect to $\psi$ to obtain:

$$
ELBO(\theta) = \underbrace{\sum_n \log p_\theta(x_n)}_{\text{marginal likelihood }+} \underbrace{- \min_{\psi} \sum_n \operatorname{KL}[q_\psi(z\vert x_n)\|p_\theta(z\vert x_n)]}_{\text{tameness of posterior}}

$$

Or, with my favourite emojis:

$$ πͺ(\theta) = \underbrace{\sum_n \log π}_{\text{marginal likelihood }+} \underbrace{- \min_{π°\inπ} \sum_n \operatorname{KL}[π°\|π²]}_{\text{tameness of posterior}} $$

The first term is the marginal likelihood, which is what you want to maximise when doing maximum likelihood. The second term is the average *tameness* of the posterior: $-\min_{π°\inπ}\sum_n\operatorname{KL}[π°\|π²]$ how well the π² can be approximated by a π°. This second term can be seen as data-dependent regulariser which depends on data, as well as on the family $\mathcal{Q}$ from which we can choose our approximate posterior.

### Is this a bug, or a feature?

You can think of this additional regularizer as either a bug or a feature. Compared with maximum likelihood, the second term biases learning towards models with simpler posteriors. If $q_\psi$ is a factorised Gaussian, then it biases us towards models in which the posterior is approximately factorised, and Gaussian. If our recognition model provides a point estimate or a very narrow Gaussian, it biases us towards models with high mutual information between latents and observations (assuming our prior $p_\theta$ has fixed entropy). The former makes no sense in itself, the latter might actually be a desirable property.

#### When it's a bug: meaningless posteriors chosen for convenience

The choice of variational posteriors is often dictated by analytical convenience: the standard choice is a factorised Gaussian. But it's usually pretty hard to motivate why we want the latent variables to end up **conditionally** independent given observations, or why they would end up conditionally Gaussian, so the tameness regularizer is usually seen as a hindrance, something we'd like to get rid of or decrease the contribution of. One way to do this is to relax the limitations on the posterior and add more flexibility. This motivated research on normalizing flows (Rezende et al, 2015), inverse autoregressive flows (Kingma et al, 2016), hybrid MCMC-variational methods (Salimans et al, 2014), operator variational inference (Ranganath et al, 2016) and adversarial variational Bayes (see this series of posts and references therein). You can understand these techniques as aiming to decrease the relative contribution of the tameness regularizer by relaxing the restrictions on the recognition model class. In other words, instead of π° they use a π to approximate π².

#### When it's a feature: meaningful posteriors

In some cases it may be more intuitive to think about your posterior, than to think about the forward model or generative process. Take, for example, the case of a latent variable image model in which positive integer hidden variables are meant to describe the number of certain types of objects in the image. It makes sense to reason about what you want your approximate posterior to do in this case: First, you want to have a guess about the total number of objects in the image, then, conditioned on this total number, you want to have some distribution over the numbers for each class - all while keeping the sum constant - such as a multinomial. So it makes sense to choose your posterior to be a mixture of multinomials (e.g. a negative multinomial distribution, but probably with a less restrictive mixing distribution). This introduces some non-trivial conditional dependence between the variables, but these dependences are the right kind, the kind you actually expect the posterior to possess.

If you choose your approximate posterior so that it makes sense to you, the tameness regularizer actually is your friend. You can think of it as a way to enforce your prior preferences over models: if you think posterior inference in the real world should work a particular way - then ELBO will respect that preference and favour models which work that way.

### Bayesian perspective: the posterior tameness prior

From a Bayesian perspective, you can also think of maximising ELBO as performing *maximum a-posteriori* (MAP) model selection with a *data-dependent* prior over model parameters:

$$
p_{\mathcal{Q},\mathcal{D}}(\theta) \propto e^{-\min_{q\sim \mathcal{Q}} \sum_n \operatorname{KL}[q(z\vert x_n)\|p_\theta(z\vert x_n)]},

$$

where $\mathcal{Q}$ is the model class, $\mathcal{D}$ is the dataset.

Firstly, this prior depends on data and in this sense many would argue it's not a truly Bayesian interpretation. But then, any time you optimise hyperparameters, you essentially end up with a data-dependent prior so the question of data-dependent priors is a matter of taste. Secondly, $p_{\mathcal{Q},\mathcal{D}}$ may not even define a proper probability distribution over $\theta$. It may be that it doesn't have a finite normalisation constant when you integrate over $\theta\in\Theta$. For example, if $\mathcal{Q}$ is the set of all conditional distributions, then the prior is a uniform which is only proper over a finite parameter space $\Theta$.

But if you accept this $p_{\mathcal{Q},\mathcal{D}}$ as your prior over models, then it gives you a way to think about designing your recognition model.

you should think about choosing the recognition model family $\mathcal{Q}$ the same way you'd think about choosing a hyperprior in a hierarchical Bayesian model

Often, people try to make $\mathcal{Q}$ as flexible as possible, which moves the prior towards a uniform - a little bit resembling *objective Bayes*. But you can also be a *subjective Bayesian* and use this intuition to restrict $\mathcal{Q}$ in a meaningful way that captures your prior preferences. A nice example of this would be to use probabilistic programs or parametric heuristics as recognition models.