# A new Favourite Machine Learning Paper: Autoencoders VS. Probabilistic Models

This is a post for machine learning nerds, so if you're not one and have no intention to become one, you'll probably not care about or understand this.

### Probabilistic interpretation of ML algorithms

My favourite theoretical machine learning papers are ones that interpret heuristic learning algorithms in a probabilistic framework, and uncover that they in fact are doing something profound and meaningful. Being trained as a Bayesian, what I mean by profound typically means statistical inference or fitting statistical models. An example would be the k-means algorithm. K-means intuitively makes sense as an algorithm for clustering. But we only really understand what it does when we make the observation that it actually is a special case of expectation-maximisation in gaussian mixture models. This interpretation as special case of something allows us to understand the expected behaviour of the algorithm better. It will allow us to make predictions about the situations in which it's likely to fail, and to meaningfully extend it to situations it doesn't handle well.

Here are the examples of my favourite papers of this kind:

- Sam Roweis and Zoubin Ghahramani's A Unifying Review of Linear Gaussian Models is probably my all-time favourite. Now it all looks trivial, but when it was published, it was an impactful paper that connected and interpreted principal components analysis, independent components analysis, Kalman filtering, HMMS, mixture modelling, and a number of other things as special cases of the same family of models. I feel like this paper contributed a lot to the development of generative modelling and unsupervised learning as a whole. Zoubin's related unsupervised Learning note remains a timeless resource to learn about these techniques.
- Rich Turner and Maneesh Sahani's Maximum Likelihood interpretation of slow feature analysis did something very similar. It's probably less impactful as Sam and Zoubin's paper, but has a very similar flavour - it took Slow Feature Analysis from the neural network literature, and reinterpreted it in the same linear Gaussian model framework as PCA or Kalman filtering.

Our attempt (shameless plug) to do something similar has been our paper with David Duvenaud, Optimally Weighted Herding is Bayesian Quadrature in which we reinterpreted kernel herding in the framework of Bayesian quadrature, thereby giving a probabilistic foundation to a heuristic method.

### My new favourite paper

So here is my new favourite paper of this kind: Generalized Denoising Auto-Encoders as Generative Models. Here is my intuition of what this paper, and this line of research is saying. It is a bit different from the actual results, the math and analogy is not watertight, but still I think this is a good way to explain:

Denoising autoencoders (DAE) work by training a deep neural network to reconstruct individual datapoints $x_i$ from noise corrupted versions $z_i$. They typically have a low-dimensional middle or 'bottleneck' layer, which forces the undercomplete representation in this layer to capture much of the signal and regularity in the data, while ignoring the irrelevant dimensions and noise. DAEs are successful because they turn the hard problem of unsupervised learning (explain the data distribution $p(x)$) to a supervised problem (predict $x$ from $z$), which we can solve more easily with today's methods.

To me autoencoders always seemed like a hack and I didn't really have a good way to think about them. But it turns out we can understand denoising autoencoders as a special case of pseudo-likelihood learning (see e.g. this or this). The twist is this: instead of thinking about fitting a probabilistic model $p(x ; \theta)$ to data $x$, you learn a joint probability distribution $p(x,z ; \theta)$ of the data $x$ and it's noise-corrupted version $z$. The noise corruption is artificially introduced by us, following a corruption distribution $p_{noise}$. The point is, if you learn the joint model $p(x,z ; \theta)$, that also implies a generative model $p(x ; \theta) = \int p(x,z ; \theta)\ dz$.

To fit the joint model to observations $(x_i, z_i)$, we are going to use score matching with the following pseudolikelihood scoring rule as objective function:

$$ \ell(\theta) = \sum_i S((x_i,z_i) , \theta) = - \sum_i \log(p(x_i\vert z_i ; \theta)) - \sum_i \log(p(z_i\vert x_i ; \theta)), $$

where $p(x\vert z; \theta)$ and $p(z\vert x; \theta)$ are conditionals corresponding to the joint model $p(x,z\vert\theta)$.

We know that pseudolikelihood is a strictly proper scoring rule, therefore if we minimise the loss function $\ell$ we get a consistent estimate of $\theta$ (again refer to this or this and references therein). So until now we are not doing anything dodgy, it's all fitting statistical models to data.

Now, we in fact know that the true distribution of $z|x$ is $p_{noise}$, so we can restrict the class of our models so that $p(z \vert x ; \theta) = p_{noise}$, this way, the second term in the objective function becomes a constant, so what we are left with is exactly the denoising autoencoder training rule. BOOM. We derived autoencoders from score matching, which is a consistent way to estimate statistical models.

This is all very hand-wavy, but this is what it means: denoising autoencoder training can be interpreted as pseudolikelihood learning, under the following condition: If the autoencoder neural network represents $p_{DAE}(x\vert z;\theta)$, and we used $p_{noise}$ to generate the noise, there exists a joint probability distribution $p(x,z ; \theta)$ such that

$$

\frac{p(x,z ; \theta)}{\int p(x,z ; \theta)\ dx} = p_{DAE}(x\vert z; theta),

$$

$$

\frac{p(z,x ; \theta)}{\int p(x,z ; \theta)\ dz} = p_{noise}(z\vert x).

$$

The importance of this observation is that machine learning researchers who are more comfortable with probabilistic models and score matching (e.g. maximum likelihood) may now be least likely to dismiss deep autoencoders as merely a hack.

In full honesty, a part of me still thinks of deep learning as a practical hack. It's an 'easy' way to throw computational power at the problem instead of modelling the priors/known invariances more explicitly, e.g. via graphical models. But even so, I'm now more tempted to look at unsupervised learning with autoencoders, not just as a practically useful hack, but as a clever training method. I now have a better handle on what it does, and how it fits into the framework I like to think in. This will easily make it to my list of favourite papers.