March 9, 2017

# The Spherical Kernel Divergence

This post is about an interesting kernel-based divergence measure I found when I wrote up my thesis on scoring rules and divergences in machine learning. It's been largely forgotten, I'm not aware of this being discussed elsewhere. I thought I'll put it out here in case people find it useful.

# Background

#### Scoring rules and divergences

Scoring rules are best described in terms of forecasting: Let's assume we want to forecast some - so far unobserved - quantity $x$. We can summarise our forecast as a probability distribution $Q$ over possible outcomes. A scoring rule $S(x,Q)$ assigns a penalty to our probabilistic forecast $Q$ in the light of the actual observation $x$.

A common scoring rule in machine learning is the log loss: $S_{\log}(x,Q) = - \log Q(x)$. When one does maximum likelihood training of $Q$, that boils down to minimising the expected score $\operatorname{argmin}_\theta \mathbb{E}_{x\sim P}S_{\log}(x,Q_\theta)$. But the logarithmic is just one of many useful examples of scoring rules.

Based on a scoring rule $S$ one can define a divergence, which, under some conditions, is a Bregman divergence:

$$d_S(P,Q) = \mathbb{E}_{x\sim P}S(x,Q) - \mathbb{E}_{x\sim P}S(x,P)$$

The scoring rule is called strictly proper, and the corresponding divergence a Bregman divergence, if and only if the following holds:

$$d(P,Q)=0 \iff P=Q$$

So strictly proper scoring rules are useful, because minimising the average score over some dataset allows us to consistently recover the true data distribution $P$, as long as our model class $Q$ contains $P$.

For the logarithmic score, the corresponding divergence is the KL-divergence

$$\operatorname{KL}[P|Q] = \mathbb{E}_{x\sim P}\log\frac{P(x)}{Q(x)}$$

There are many other useful scoring rules, and corresponding divergences, for a review see (Gneiting & Raftery, 2007) or chapter 1 of my thesis.

Finally, here is a lighthearted, graphical illustration of Bregman divergences, in terms of the entropy function $\mathbb{H}_S[Q] = \mathbb{E}_{x\sim Q}S(x,Q)$: ##### Maximum mean discrepancy and the kernel scoring rule

I'm going to show another divergence, maximum mean discrepancy (MMD), and the associated kernel scoring rule. MMD is used in two-sample testing, and more recently in Generative Moment Matching Networks (Li et al, (2016), Dziugaite et al. (2016)) for generative modelling.

\begin{align}
\operatorname{MMD}^2(x,y) &= \sup_{|f|_{\mathcal{H}_k}=1} \left( \mathbb{E}_{x\sim P}f(x) - \mathbb{E}_{x\sim Q}f(x) \right)^2\\
&= |\mu_P - \mu_Q|_{\mathcal{H}_k}^2\\
&= \mathbb{E}_{x,y\sim P} k(x,y) + \mathbb{E}_{x,y\sim Q} k(x,y) - 2 \mathbb{E}_{x\sim P,y\sim Q} k(x,y)
\end{align}

The first line explains why this divergence is called Maximum Mean Discrepancy: it finds the worst-case function $f$ (from the unit ball of the Hilbert-space, $|f|_{\mathcal{H}_k}=1$) whose average value under $P$ is most different from its average under $Q$. In this sense, $f$ plays a similar role to the adversarial discriminator in GANs: it's trying to characterise the differences between $P$ and $Q$.

The second line is the definition in terms of the so called mean embeddings: the functions $\mu_P = \mathbb{E}_{x\sim P}k(\cdot,x)$ and $\mu_Q = \mathbb{E}_{x\sim Q}k(\cdot,x)$ in the Hilbert space which uniquely characterise $P$ and $Q$, respectively.

The third line expresses MMD in terms of the kernel function $k$, and this is the formula we use when we actually want to compute this divergence.

The squared MMD is in fact a Bregman divergernce which corresponds to the kernel scoring rule, defined as follows:

\begin{align}
S_k(x,Q) &= k(x,x) - 2\mathbb{E}_{y\sim Q}k(x,y) + \mathbb{E}_{y,z\sim Q}k(y,z)\\
&= \sup_{|f|_{\mathcal{H}_k}=1} (f(x) - \mathbb{E}_{y\sim Q} f(y))^2\\
&= |k(\cdot,x) - \mu_Q|_{\mathcal{H}_k}^2\\
&= MMD^2(\delta_x, Q)
\end{align}

What is interesting about this scoring rule is that you can view it as the kernel generalisation of the ancient Brier scoring rule (Brier, 1950. Gneiting & Raftery, 2017) which measures $L_2$ norm rather than RKHS-norm:

\begin{align}
S_{Brier}(x,Q)&= |\delta_x - Q |_2^2\\
&= |Q|_2^2 - 2Q(x) + 1
\end{align}

Where $|Q|_2$ is the $L_2$-norm of $Q$ defined as $|Q|_2 = \sqrt{\mathbb{E}_{y\sim Q}Q(y)}$. The Brier score can be viewed as the kernel score with an extremely narrow kernel: $k(x,y)\rightarrow \delta(x-y)$.

# The spherical kernel score

And this brings us to the actual meat in this post. The Brier score has a younger sister, called the spherical score (Good, 1971.Gneiting & Raftery, 2017):

\begin{align}
S_{spherical}(x,Q) &= 1 - \frac{Q(x)}{|Q|_2} \\
&= 1 - \frac{Q(x)}{\sqrt{\mathbb{E}_{y\sim Q}Q(y)}}
\end{align}

The question arises, if the Brier/quadratic divergence has a kernel analog in MMD, can we also construct a kernel analog to the spherical score and its divergence? The answer is yes, and this is what I call the spherical kernel score.

\begin{align}
S_{spherical,k}(x,Q) &= 1 - \frac{\mu_Q(x)}{|\mu_Q|_{\mathcal{H}_k}|\mu_{\delta_x}|_{\mathcal{H}_k}} \\
&= 1 - \frac{\mathbb{E}_{y\sim Q}k(x,y)}{\sqrt{k(x,x)\mathbb{E}_{y,z\sim Q}k(y,z)}}
\end{align}

And the associated divergence, which I call spherical kernel divergence (SKD, $d_{spherical,k}$) becomes:

\begin{align}
d_{spherical,k}(P,Q) &= 1 - \frac{\langle\mu_P,\mu_Q\rangle_{\mathcal{H}_k}}{|\mu_Q|_{\mathcal{H}_k}|\mu_P|_{\mathcal{H}_k}}\\
&= 1 - \cos(\mu_P,\mu_Q)\\
& = 1 - \frac{\mathbb{E}_{x\sim P,y\sim Q}k(x,y)}{\sqrt{\mathbb{E}_{x,y\sim P}k(x,y)\mathbb{E}_{x,y\sim Q}k(x,y)}},
\end{align}

where $\langle\mu_P,\mu_Q\rangle_{\mathcal{H}_k}$ denotes inner product between the mean embeddings and $\cos(\mu_P,\mu_Q)$ is defined as

$$\cos(\mu_P,\mu_Q) = \frac{\langle\mu_P,\mu_Q\rangle_{\mathcal{H}_k}}{|\mu_P|_{\mathcal{H}_k}|\mu_Q|_{\mathcal{H}_k}}.$$

While MMD measures the analog of Euclidean distance between mean embeddings, the spherical kernel divergence measures the analog of cosine dissimilarity between them. Just as with MMD and the Brier divergence, the spherical kernel score reduces to the spherical score in the limit of infinitesimal kernel bandwidth $k(x,y)\rightarrow \delta(x-y)$.

#### Random projections interpretation

There is an interesting property of cosine dissimilarity, which is used to approximate cosine distance via locality sensitive hashing (LSH). This allows us to interpret the SKD as follows:

\begin{align}
d_{spherical,k}(P,Q) &= \mathbb{P}_{f\sim \mathcal{GP}_k} \left[ \operatorname{sign}(\langle f,\mu_P \rangle) \neq \operatorname{sign}(\langle f,\mu_P \rangle) \right]\\
&= \mathbb{P}_{f\sim \mathcal{GP}_k} \left[ \operatorname{sign}(\mathbb{E}_{x\sim P}f(x)) \neq \operatorname{sign}(\mathbb{E}_{x\sim Q}f(x))\right]
\end{align}

So, let's say we sample a bunch of smooth scalar test functions $f$ from a Gaussian process with covariance kernel $k$. We take the average of each test function on samples from $P$ as well as on samples from $Q$. Then we take the average of each test function under $P$ and under $Q$ and we look at the probability that the signs differ. So while the MMD cares about the squared difference between $\mathbb{E}_{x\sim P}f(x)$ and $\mathbb{E}_{x\sim Q}f(x)$, the SKD only cares about whether they have the same the same sign or not.

### Outstanding challenges

The obvious question is: How can we approximate the spherical kernel score in an unbiased way, from finite samples? This is a requirement if we want to use it for generative modelling - if we want to do stochastic gradient descent. BTW, hat tip to Dziugaite et al. (2016) who actually used the unbiased MMD estimate in their version of generative moment matching networks (Eq. (8)) , while Li et al, (2016) used a biased estimate (Eqn. (3)). I don't actually know what an unbiased estimate of the spherical kernel divergence would look like, so if you do, please leave a comment!