July 16, 2015

Exchangeable Models via Recurrent Neural Networks?

This is a dump of my thoughts, it's an idea I had for a while and I thought I'd open it up for feedback and discussion.

Recurrent Neural Networks (RNNs) provide a rich framework for defining autoregressive processes: processes that are defined in terms of one-step-ahead predictive probabilities $p(x_n\vert x_{1:n-1})$. While RNNs are mostly used to model non-exchangeable data such as text, I started wondering if they could be useful in modeling exchangeable data for data-efficient unsupervised and supervised learning.

There are two ideas/concepts in this post, neither of them close to being solved. I didn't really have the time to do work on either of them any further, so I'm open to feedback from the community in case anyone things there's something in there:

1. Q1: Can we design RNNs or other autoregressive neural networks that produce exchangeable data?
2. Q2: Is there a de Finetti theorem for supervised learning?
3. putting these together, can we design an RNN for general exact Bayesian discriminative learning?

Exchangeability and Bayesian computation

Exchangeability is a core concept in Bayesian statistics and probabilistic unsupervised learning. A stochastic process (infinite sequence of random variables) is said to be exchangeable if for all $n$ and all permutation $\pi$:

$$P(x_1,\ldots,x_n) = p(x_{\pi(1)},\ldots,x_{\pi(n)})$$

That is, the order in which datapoints $x_i$ appear does not matter. Exchangeability is a generalisation of the independent and identically distributed (i.i.d.) approach and is intimately related to Bayesian analysis.

De Finetti's theorem says that if a process is exchangeable then it can be equivalently defined as a potentially infinite mixture of i.i.d. random processes, such as this:

$$P(x_1,\ldots,x_n) = \int p(\theta) \prod_{i=1}^n p(x_i\vert\theta) d\theta.$$

Here $\theta$ is som (potentially infinite dimensional) parameter vector (excuse the sloppy notation that doesn't make sense for nonparametric cases), conditioned on which the data is i.i.d. The connection to Bayesian statistics is this: if the data is exchangeable, then it makes sense to do Bayesian inference over i.i.d. models with some prior over the parameters, at least we know that there is at least one Bayesian model like this that is exactly correct.

Predictive characterisation of exchangeable processes

How does this relate to autoregressive processes? Well, you can characterise exchangeability in terms of one-step-ahead predictive distributions, as described in great detail and watertight notation in (Fortini and Petrone, 2012). Below is the gist of it:

Stochastic processes can be defined in terms of predictive distributions $p(x_{n+1}\vert x_{1:n})$ - I also like to call these autoregressive distributions. The de Finetti theorem can be equivalently formulated in terms of these autoregressive distributions. We know that exchangeability holds whenever the process satisfies:

1. $p(x_{n+1}\vert x_1,\ldots,x_n) = p(x_{n+1}\vert x_{\pi(1)},\ldots,x_{\pi(n)})$ for all $n$ and all permutations $\pi$
2. $P(X_{n+1} = x_{n+1},X_{n+2} = x_{n+2}\vert x_1,\ldots,x_n) = P(X_{n+1} = x_{n+2},X_{n+2} = x_{n+1}\vert x_1,\ldots,x_n)$ for all $n$.

If these conditions hold, then the data this process generates is exchangeable, and the de Finetti theorem can be written as:

$$p(x_{n+1}\vert x_{1:n}) = \int p(x_{n+1}\vert \theta) p(\theta \vert x_{1:n}) d\theta$$

Perhaps the connection to Bayesian inference is even clearer in this formulation, as essentially we can say that the predictive distribution can be written as an integral over the posterior distribution $p(\theta \vert x_{1:n})$.

What this is saying is that we can construct an arbitrary predictive "autoregressive" process, and as long as it satisfies the two conditions above, we can say that in fact we do exact inference in some Bayesian (nonparametric) model. We don't know what that Bayesian model is, but we know there is one Bayesian model whose predictive equations would be exactly the same as our autoregressive distributions.

For example, the Chinese Restaurant Process is often described in terms of a sequence of predictive distributions like this. We know that because it satisfies exchangeability, we can interpret the underlying computation as exact Bayesian averaging over i.i.d. models. Now, we happen to know that the underlying distribution is the Dirichlet process, but if we didn't know that, we'd still be happy using the CRP because it's exchangeable. See also the paper on the Indian Buffet Process for another example of how this reasoning works in practice.

Hypothesis 1: we can use RNNs to define exchangeable processes

Various flavours of recurrent neural networks can be used to define sequences of predictive probabilities like above, ones that are easy to compute. In fact, they can define really complicated autoregressive processes capable of generating text and stuff. My hypothesis is, if we could somehow force RNNs to produce exchangeable data, we could use them to model i.i.d. data, and rest assured that we're doing Bayesian inference under the hood.

Question 1: can we constrain the structure of RNNs in any way that would ensure both conditions above are met. If we could, we would have a rich family of Bayesian nonparametric models - defined implicitly - in which exact prediction is super easy, and scales (hopefully) linearly with data. Moreover, fitting parameters via standard backprop through time would now amount to maximum marginal likelihood.

Question 2: If we can't technically enforce exchangeability, can we add a term to the objective function to make the RNN approximately exchangeable? For example: train it on minibatches but with the added objective function that enforces that reading the minibatch backwards should give the same marginal likelihood as reading it in the original order.

Observation: If we had an exchangeable RNN, we could train it on multiple unsupervised learning problems over the same input space. Such system actually learns to learn. If you want to use it on a new dataset, you just feed it into the RNN, and it will give you Bayesian predictive probabilities without any additional computation. So it would be an ultimate general inference machine™.

Problem: If we want to model complex data, this approach requires an RNN that can define normalised probabilities over that complex space, which is not going to happen. So this is only useful in simple input spaces. Combining this with the next hypothesis might make the idea more useful.

Hypothesis 2: we can extend the predictive characterisation of exchangeability and the de Finetti theorem to supervised learning

Exchangeability deals with unsupervised learning, but perhaps we can come up with equivalent theory for supervised learning. In supervised learning we are given two sequences $x_{1:n}$ and $y_{1:n}$, and any model or learning algorithm is technically defined as a sequence of conditional probabilities $p(y_{1:n}\vert x_{1:n})$. Equivalently, we can define the same thing in terms of predictive probabilities $p(y_{n+1}\vert x_{n+1},x_{1:n},y_{1:n})$. The question is, can we come up with a concept of exchangeability for these "conditional random processes" which would ensure that we can represent the process as a mixture of conditionally i.i.d. models, e.g:

$$p(y_{1:n}\vert x_{1:n}) = \int p(\theta) \prod_{i=1}^{n} p(y_i\vert x_i, \theta) d\theta$$

or equivalently

$$p(y_{n+1}\vert x_{n+1},x_{1:n},y_{1:n}) = \int p(y_{n+1}\vert x_{n+1}, \theta) p(\theta \vert x_{1:n}, y_{1:n}) d\theta$$.

The main question is what those exchangeability conditions would be. The relevant example is Gaussian process regression: one can define Gaussian process regression in terms of posterior predictive distributions $p(y_{n+1}\vert x_{n+1},x_{1:n},y_{1:n})$ without a reference to the underlying nonparametric process. Ideally, you should just be able to check a few conditions, and if it holds, you could lean back and trust the fact that the predictive process carries out exact Bayesian inference over predictive models under the hood.

Here is a note I wrote up on this when Sonia Petrone visited the lab in 2010: http://inft.ly/35W7eAe

If we could extend the concept of exchangeability to the supervised learning case, and if we could construct exchangeable processes flexibly via RNNs, we would have a general purpose Bayesian supervised learning tool in which inference would be super easy.

Conclusion

Bayesian inference has a lot of advantages and it typically it performs better in small data situations. However, the class of models where exact Bayesian inference and prediction is possible is very small. If we could construct tractable exchangeable autoregressive processes via RNNs or something similar, we would suddenly have a wide class of Bayesian algorithms at our disposal where inference is trivial. So I think this is a genuinely interesting direction to look at.

So please, let me know what you think about these hypotheses and whether you think it's worth spending time working on either.