November 16, 2017

# A Cookbook for Machine Learning: Vol 1

This was a busy week, I had no time to read anything new, so I'm sharing a note that I wrote for myself, for no other reason than to understand things better. It's a kind of cookbook of various "transformations" you can apply to a machine learning problem to eventually turn it into something we know how to solve: seeking stable attractors of a tractable vector field.

The typical setup is: you have some model parameters $\theta$. You seek to optimize some objective criterion, but the optimization problem is intractable or hard in one of the ways listed below. You then apply the corresponding transformation to your problem if you can. If your problem is now one you can efficiently optimize, great. If not, you can recursively apply the transformations until it is.

UPDATE: Although I called this post a cookbook, as readers so rightly pointed out, it is too light on details to be considered a cookbook. Consider it as a demonstration of a way of thinking about machine learning research as a compiler that compiles an abstract machine learning objective into the canonical optimization problem of finding stable attractors of a tractable vector field.

For the first batch, I have written up the following problem transformations:

There are many more transformations not included here, like the duality principle, half-quadratic splitting, Lagrangian multipliers, etc. Feel free to leave comments about what else I should include next.

## Variational bounds

#### Typical problem:

My loss function $f(\theta)$ is intractable to compute, typically because it involves intractable marginalization. I can't evaluate it let alone minimize it.

#### Solution:

Let's construct a family of - typically differentiable - upper-bounds:
$$f(\theta) \leq \inf_\psi g(\theta, \psi),$$
and solve the optimization problem
$$\theta^\ast, \psi^\ast \leftarrow \operatorname{argmin}_{\theta, \psi} g(\theta, \psi)$$
instead. Technically, once optimization is finished, you can discard the auxiliary parameter $\psi^\ast$ - although often turns out to be meaningful and useful in itself, often for approximate inference such as the recognition model of VAEs.

Jensen's inequality: The mean value of a convex function is never lower than the value of the convex function applied to the mean. Generally appears in some variant of the standard evidence lower bound (ELBO) derivation below:

\begin{align}

• \log p(x) &= - \log \int p(x,y) dy \\
&= - \log \int q(y\vert x) \frac{p(y,x)}{q(y\vert x)}dy \\
&\leq - \int q(y\vert x) \log \frac{p(y,x)}{q(y\vert x)} dy
\end{align}

Reparametrization trick: In variational inference we often encounter gradients of the form:
$$\frac{\partial}{\partial \theta_i}\mathbb{E}_{x \sim q_\theta}\left[f(x, q_\theta(x))\right],$$
where the pdf of the variable appears in the integrand. If we can find a function $h:(\mathcal{E}, \Theta)\mapsto \mathcal{X}$ which is differentiable w.r.t. its second argument, and probability distribution $p_\epsilon$ over $\mathcal{E}$ which is easy to sample from, such that the following holds:
$$x = h(\epsilon, \theta), \epsilon \sim p_\epsilon \iff x \sim q_{\theta},$$
We can use the following reformulation of integrals we often encounter in variational upper bounds.
$$\frac{\partial}{\partial \theta_i}\mathbb{E}_{x \sim q_\theta}\left[f(x, q_\theta(x))\right] = \mathbb{E}_{\epsilon \sim p_\epsilon}\left[ \frac{\partial}{\partial \theta_i} f(h(\epsilon, \theta), q_\epsilon(h(\epsilon, \theta)))\right]$$
A Monte Carlo estimators to this expectation typically have substantially lower variance than REINFORCE estimators to the same quantity.

#### Typical problem:

I can't estimate my loss function $f(\theta)$ directly from samples, typically because the loss function depends on the pdf of the data distribution or that of my model, or both.

#### Solution:

We can sometimes construct an approximation so that
$$f(\theta) \approx h(\theta, \operatorname{argmin}_\psi g(\theta, \psi)),$$
and then we can solve the problem of finding a stable equilibrium of the two-player game where players minimize losses $g$ and $h$ with respect to $\psi$ and $\theta$, respectively.
Sometimes, the approximations may come in the form of lower-bounds, which is the case when $h=-g$:
$$f(\theta) \geq \sup_\psi g(\theta, \psi),$$
in which case we can solve the following minimax problem instead:
$$\theta^\ast \leftarrow \operatorname{argmin}_{\theta} \max_{\psi} g(\theta, \psi)$$

Bayes optimality in auxiliary tasks: When your loss function depends on densities of probability distributions you can easily sample from, often you can construct an auxiliary task, whose Bayes optimal solution depends on values of the densities in question. Examples of auxiliary tasks are: binary classification for likelihood ratio estimation, denoising or score matching for estimating the score function.

Convex conjugate: In case your loss function involves convex functional of the densities (such as in f-divergences), you can transform your problem by re-expressing it in terms of the convex conjugate. The expression for $f$ in terms of its convex conjugate $f^{\ast}$ is:
\begin{align}
f(u) &= \sup_{v \in \operatorname{dom}(f^{\ast})} \{ \langle u, v \rangle - f^{\ast}(v) \} \\
&\geq \sup_{\psi} \{ \langle u, v_\psi \rangle - f(v_\psi) \},
\end{align}
Note that if $u$ is a density function, then the inner product $\langle u,v_\psi\rangle$ is the expectation of $v_\psi$, which can be approximated by Monte Carlo sampling.

## Evolution strategies

#### Typical problem

My $f(\theta)$ is easy to evaluate but hard to optimize, perhaps because it contains discrete operations, or the function is piecewise constant. Backpropagation is not possible.

#### Solution

Observe that for any probability distribution $p_\psi$ over $\theta$ the following holds:
$$\min_{\theta}f(\theta) \leq \min_{\psi} \mathbb{E}_{\theta\sim p_\psi} f(\theta)$$
therefore in ES we concentrate of the following optimization problem instead:
$$\psi^{\ast} \leftarrow \operatorname{argmin}_{\psi} \mathbb{E}_{\theta\sim p_\psi} f(\theta)$$
Often, depending on the function $f$ and class of distributions $p_\psi$, a local minimum of $f$ can be recovered from a local minimum of $\psi$.

REINFORCE gradient estimator: It relies on the following trick
$$\frac{\partial}{\partial \psi_i} \mathbb{E}_{\theta\sim p_\psi}[f(\theta)] = \mathbb{E}_{\theta \sim p_\psi}\left[\frac{\partial}{\partial \psi_i}\log p_\psi(\theta)f(\theta)\right],$$
where the RHS can be easily approximated by Monte Carlo. The variance of Monte Carlo REINFORCE estimators tends to be quite high.

## Convex relaxation

#### Typical problem

My $f(\theta)$ is hard to optimize, because it has non-differentiable and non-convex components like the $\ell_0$-norm of a vector in sparse methods, or the Heaviside step function in classification.

#### Solution

Replace the non-convex component by a convex approximation, turning your objective into a now typically convex $g$
$$f(\theta) \approx g(\theta)$$

$\ell_1$ loss: In many sparse learning situations we wish to minimize counts of non-zero entries in a vector, which is known as $\ell_0$ loss. You can typically replace this by the $\ell_1$ norm of the vector.