November 15, 2018

# Online Bayesian Deep Learning in Production at Tencent

Bayesian deep learning methods often look like a theoretical curiosity, rather than a practically useful tool, and I'm personally a bit skeptical about the practical usefulness of some of the work. However, there are some situations a decent method of handling and representing residual uncertainties about model parameters might prove crucial. These applications include active learning, reinforcement learning and online/continual learning.

So as I recently read a paper by Tencent, I was surprised to learn that the online Bayesian deep learning algorithm is apparently deployed in production to power click-through-rate prediction in their ad system. I thought, therefore, that this is worth a mention. Here is the paper:

Though the paper has a few typos and sometimes inconsistent notation (like going back and forth between using $\omega$ and $w$) which can make it tricky to read, I like the paper's layout: starting from desiderata - what the system has to be able to do - arriving at an elegant solution.

## Assumed Density Filtering

The method relies on the approximate Bayesian online-learning technique often referred to as assumed density filtering.

ADF has been independently discovered by the statistics, machine learning and control communities (for citations see (Minka, 2001). It is perhaps most elegantly described by Opper, (1998), and by Minka, (2001) who also extended the idea to develop expectation propagation, a highly influential method for approximate Bayesian inference not just for online settings.

ADF can be explained as recursive algorithm repeating the following steps:

• The starting point for the algorithm at time $t$ is a "temporal prior" distribution $q_{t-1}(w)$ over parameters $w$. This $q_{t-1}(w)$ incorporates all evidence from datapoints observed in previous timesteps, and it is assumed to approximate the posterior $p(w\vert x_{1:t-1})$ conditioned on all data oberved so far. $q_{t-1}(w)$ is assumed to take some simple form, like a Gaussian distribution factorized over the elements of $w$.
• Then, some new observations $x_t$ are observed. The Bayesian way to learn from these new observations would be to update the posterior: $p_t(w|x_t) \propto p(x_t\vert w)q_{t-1}(w)$
• However, this posterior might be complicated, and may not be available in a simple form. Therefore, at this point we approximate $p_t(w|x_t)$ by a new simple distribution $q_{t}$, which is in the same family as $q_{t-1}$. This step can involve a KL divergence-based approximation, e.g. (Opper, 1998, Minka, 2001), or Laplace approximation e.g. (Kirkpatrick et al, 2017, Huszár, 2017), or a more involved inner loop such as EP in MatchBox (Stern et al, 2009) or probabilistic backpropagation (Hernandez-Lobato and Adams, 2015) as in this paper.

## Probabilistic Backpropagation

The Tencent paper is based on probabilistic backpropagation (Hernandez-Lobato and Adams, 2015), which uses the ADF idea multiple times to perform the two non-trivial tasks required in a supervised Bayesian deep network: inference and learning.

• forward propagation: In Bayesian deep learning, we maintain a distribution $q(w)$ over neural network weights, and each value $w$ defines a conditional probability $p(y\vert x, w)$. To predict the label $y$ from the input $x$, we have to average over $q(w)$, that is calculate $p(y\vert x) = \int q(w)p(y\vert x,w)dw$, which is difficult due to the nonlinearities.
Probabilistic backprop uses an ADF-like algorithm to approximate this predictive distribution. Starting from the bottom of the network, it approximates the distribution of the first hidden layer's activations given the input $x$ with a Gaussian. The first hidden layer will have a distribution of activations because we have a distribution over the weights) It then propagates that distribution to the second layer, and approximates the result with a Gaussian. This process is repeated until the distribution of $y$ given $x$ is calculated in the last step, which can be done easily if one uses a probit activation at the last layer.
• backward propagation: Forward propagation allows us to make a prediction given an input. The second task we have to be able to perform is to incorporate evidence from a new datapoint $(x_t, y_t)$ by updating the distribution over weights $q_t(w)$ to $p_t(w|x_t, y_t) \propto p(y_t\vert x_t, w) q_{t-1}(w)$. We approximate this $p_t(w|x_t, y_t)$ in an inner loop, by first running probabilistic forward propagation, then a similar ADF-like sweep backwards in the network.

I find the similarity of this algorithm to gradient calculations in neural network training beautiful: forward propagation to perform inference, forward and backward propagation for learning and parameter updates.

Crucially though, there is no gradient descent, or indeed no gradients or iterative optimization taking place here. After a single forward-backward cycle, information about the new datapoint (or minibatch of data) is - approximately - incorporated into the updated distribution $q_{t}(w)$ of network weights. Once this is done, the mini-batch can be discarded and never revisited. This makes this method a great candidate for online learning in large-scale streaming data situations.

## Parallel operation

Another advantage of this method is that it can be parallelized in a data-parallel fashion: Multiple workers can update the posterior simultaneously on different minibatches of data. Expectation-propagation provides a meaningful way of combining the parameter updates computed by parallel workers in a meaningful fashion. Indeed, this is what the PBODL algorithm of Tencent does.