Connection between denoising and unsupervised learning

This is just a note showing the connection between learning to remove additive Gaussian noise and learning a probability distribution. I think this connection highlights why learning to denoise is, in fact, a form of unsupervised learning. This is despite the fact that the training procedure is more similar to supervised learning and is based on minimising squared loss.

I couldn't find the paper that describes this from the beginning, so I'd rather write this down. People whose work is very relevant to this area are Harri Valpola, Pascal Vincent and colleagues.

Let's for sake of simplicity consider denoising a single dimensional quantity $x$. $x$ has distribution $p_x$, which is unknown to us, but we have i.i.d. samples $x_i$ provided. Let's consider the problem removing additive noise, let $\tilde{x} = x + \epsilon$ denote the noise corrupted version of $x$, where $\epsilon \sim \mathcal{N}(0,\sigma_n) = p_{noise}$. Our aim is to find a denoising function $f$ that reconstructs $x$ from $\tilde(x)$ optimally under the mean squared loss:

$$MSE(f) = \mathbb{E}_{x,\tilde{x}}(x - f(\tilde{x}))^2.$$

It's interesting to think about what this function looks like in the 1D case. Here is a nice illustration from (Rasmus et al):

The optimal denoising function

We know that if $f$ can take an arbitrary shape, the optimal solution is the mean of the Bayesian posterior:

$$f^{*}(\tilde{x}) = \mathbb{E}_{x\vert \tilde{x}} [x] = \frac{\int x \cdot p_{noise}(\tilde{x}\vert x) p_x(x) dx}{\int p_{noise}(\tilde{x}\vert x) p_x(x) dx} $$

To reiterate, if we use the MSE to measure reconstruction loss, the Bayesian posterior mean $f^{*}$ is always the optimal reconstruction function. Any method of obtaining a denoising function $f$ by optimising MSE will approximate the performance of the Bayesian optimum, and will have at least the MSE of $f^{*}$.

Approximation for small $\sigma_n$

If the noise variance is low, we can approximate $f^{*}$ analytically by using a locally linear Tailor expansion of the prior distribution $p_x$ around the noise-corrupted observation $\tilde{x}$:

$$p_x(x) \approx p_x(\tilde{x}) + \partial_{x}p_x(\tilde{x}) \cdot (x - \tilde{x})$$

We can substitute this approximation back to the formula for $f^{*}$, and solve the integrals analytically (it's all Gaussians and linear functions so it's easy). The final formula we obtain is:

$$f^{*}(\tilde{x}) \approx \tilde{x} + \sigma^2_{n}\cdot \frac{\partial_{x}p_x(\tilde{x})}{p_{x}(\tilde{x})} = \tilde{x} + \sigma^2_n \cdot \partial_x \log p_x(\tilde{x})$$

So the optimal denoising function is mainly driven by the gradient of the log probability $p_x$ at the corrupted input $\tilde{x}$. This approximation holds anytime where $p_x$ is relatively slowly varying within the lengthscale of the noise distribution.