<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[inFERENCe]]></title><description><![CDATA[posts on machine learning, statistics, opinions on things I'm reading in the space]]></description><link>http://www.inference.vc/</link><generator>Ghost 0.11</generator><lastBuildDate>Sat, 23 Jun 2018 02:51:17 GMT</lastBuildDate><atom:link href="http://www.inference.vc/rss/" rel="self" type="application/rss+xml"/><ttl>60</ttl><item><title><![CDATA[ML beyond Curve Fitting: An Intro to Causal Inference and do-Calculus]]></title><description><![CDATA[<p>You might have come across <a href="https://www.basicbooks.com/titles/judea-pearl/the-book-of-why/9780465097609/">Judea Pearl's new book</a>, and a <a href="https://www.quantamagazine.org/to-build-truly-intelligent-machines-teach-them-cause-and-effect-20180515/">related interview</a> which was widely shared in my social bubble. In the interview, Pearl dismisses most of what we do in ML as curve fitting. While I believe that's an overstatement (conveniently ignores RL for example), it's a nice</p>]]></description><link>http://www.inference.vc/untitled/</link><guid isPermaLink="false">f01e8d56-5117-482e-8a87-82dec692f14d</guid><dc:creator><![CDATA[Ferenc Huszar]]></dc:creator><pubDate>Thu, 24 May 2018 14:07:33 GMT</pubDate><media:content url="http://www.inference.vc/content/images/2018/05/Causality_-building-a-bridge-1.png" medium="image"/><content:encoded><![CDATA[<img src="http://www.inference.vc/content/images/2018/05/Causality_-building-a-bridge-1.png" alt="ML beyond Curve Fitting: An Intro to Causal Inference and do-Calculus"><p>You might have come across <a href="https://www.basicbooks.com/titles/judea-pearl/the-book-of-why/9780465097609/">Judea Pearl's new book</a>, and a <a href="https://www.quantamagazine.org/to-build-truly-intelligent-machines-teach-them-cause-and-effect-20180515/">related interview</a> which was widely shared in my social bubble. In the interview, Pearl dismisses most of what we do in ML as curve fitting. While I believe that's an overstatement (conveniently ignores RL for example), it's a nice reminder that most productive debates are often triggered by controversial or outright arrogant comments. Calling machine learning alchemy was a great recent example. After reading the article, I decided to look into his famous do-calculus and the topic causal inference once <em>again</em>.</p>

<p>Again, because this happened to me semi-periodically. I first learned do-calculus in a (very unpopular but advanced) undergraduate course Bayesian networks. Since then, I have re-encountered it every 2-3 years in various contexts, but somehow it never really struck a chord. I always just thought "this stuff is difficult and/or impractical" and eventually forgot about it and moved on. I never realized how fundamental this stuff was, until now.</p>

<p>This time around, I think I fully grasped the significance of causal reasoning and I turned into a full-on believer. I know I'm late to the game but I almost think it's basic hygiene for people working with data and conditional probabilities to understand the basics of this toolkit, and I feel embarrassed for completely ignoring this throughout my career.</p>

<p>In this post I'll try to explain the basics, and convince you why you should think about this, too. If you work on deep learning, that's an even better reason to understand this. Pearl's comments may be unhelpful if interpreted as contrasting deep learning with causal inference. Rather, you should interpret it as highlighting causal inference as a huge, relatively underexplored, application of deep learning. Don't get discouraged by causal diagrams looking a lot like Bayesian networks (not a coincidence seeing they were both pioneered by Pearl) they don't compete with, they complement deep learning.</p>

<h2 id="basics">Basics</h2>

<p>First of all, causal calculus differentiates between two types of conditional distributions one might want to estimate. <strong>tldr</strong>: in ML we usually estimate only one of them, but in some applications we should actually try to or have to estimate the other one.</p>

<p>To set things up, let's say we have i.i.d. data sampled from some joint $p(x,y,z,\ldots)$. Let's assume we have lots of data and the best tools (say, deep networks) to fully estimate this joint distribution, or any property, conditional or marginal distribution thereof. In other words, let's assume $p$ is known and tractable. Say we are ultimately interested in how variable $y$ behaves given $x$. At a high level, one can ask this question in two ways:</p>

<ul>
<li><p>observational $p(y\vert x)$: What is the distribution of $Y$ given that I <strong>observe</strong> variable $X$ takes value $x$. This is what we usually estimate in supervised machine learning. It is a conditional distribution which can be calculated from $p(x,y,z,\ldots)$ as a ratio of two of its marginals. $p(y\vert x) = \frac{p(x,y)}{p(x)}$. We're all very familiar with this object and also know how to estimate this from data.</p></li>
<li><p>interventional $p(y\vert do(x))$: What is the distribution of $Y$ if I were to <strong>set</strong> the value of $X$ to $x$. This describes the distribution of $Y$ I would observe if I intervened in the data generating process by artificially forcing the variable $X$ to take value $x$, but otherwise simulating the rest of the variables according to the original process that generated the data. (note that the data generating procedure is NOT the same as the joint distribution $p(x,y,z,\ldots)$ and this is an important detail).</p></li>
</ul>

<h2 id="arenttheythesamething">Aren't they the same thing?</h2>

<p>No. $p(y\vert do(x))$ and $p(y\vert x)$ are not generally the same, and you can verify this with several simple thought experiments. Say, $Y$ is the pressure in my espresso machine's boiler which ranges roughly between $0$ and $1.1$ bar depending on how long it's been turned on. Let $X$ be the reading of the built-in barometer. Let's say we jointly observe X and Y at random times. Assuming the barometer functions properly $p(y|x)$ should be a unimodal distribution centered around $x$, with randomness due to measurement noise. However, $p(y|do(x))$ won't actually depend on the value of $x$ and is generally the same as $p(y)$, the marginal distribution of boiler pressure. This is because artificially setting my barometer to a value (say, by moving the needle) won't actually cause the pressure in the tank to go up or down.</p>

<p>In summary, $y$ and $x$ are correlated or statistically dependent and therefore seeing $x$ allows me to predict the value of $y$, but $y$ is not caused by $x$ so setting the value of $x$ won't effect the distribution of $y$. Hence, $p(y\vert x)$ and $p(y\vert do(x))$ behave very differently. This simple example is just the tip of the iceberg. The differences between interventional and observational conditionals can be a lot more nuanced and hard to characterize when there are lots of variables with complex interactions.</p>

<h2 id="whichonedoiwant">Which one do I want?</h2>

<p>Depending on the application you want to solve, you should seek to estimate one of these conditionals. If your ultimate goal is diagnosis or forecasting (i.e. observing a naturally occurring $x$ and inferring the probable values of $y$) you want the observational conditional $p(y\vert x)$. This is what we already do in supervised learning, this is what Judea Pearl called curve fitting. This is all good for a range of important applications such as classification, image segmentation, super-resolution, voice transcription, machine translation, and many more.</p>

<p>In applications where you ultimately want to control or choose $x$ based on the conditional you estimated, you should seek to estimate $p(y\vert do(x))$ instead. For example, if $x$ is a medical treatment and $y$ is the outcome, you are not merely interested in observing a naturally occurring treatment $x$ and predicting the outcome, we want to <em>proactively choose</em> the treatment $x$ given our understanding of how it effects the outcome $y$. Similar situations occur in system identification, control and online recommender systems.</p>

<h2 id="whatexactlyisdpyvertdoxd">What exactly is $p(y\vert do(x))$?</h2>

<p>This is perhaps the main concept I haven't grasped before. $p(y\vert do(x))$ is in fact a vanilla conditional distribution, but it's not computed based on $p(x,z,y,\ldots)$, but a different joint $p_{do(X=x)}(x,z,y,\ldots)$ instead. This $p_{do(X=x)}$ is the joint distribution of data which we would observe if we actually carried out the intervention in question.  $p(y\vert do(x))$ is the conditional distribution we would learn from data collected in <a href="https://en.wikipedia.org/wiki/Randomized_controlled_trial">randomized controlled trials</a> or A/B tests where the experimenter controls $x$. Note that actually carrying out the intervention or randomized trials may be impossible or at least impractical or unethical in many situations. You can't do an A/B test forcing half your subjects to smoke weed and the other half to smoke placebo to understand the effect on marijuana on their health. Even if you can't directly estimate $p(y\vert do(x))$ from randomized experiments, the object still exists. The main point of causal inference and do-calculus is:</p>

<blockquote>
  <p>If I cannot measure $p(y\vert do(x))$ directly in a randomized controlled trial, can I estimate it based on data I observed outside of a controlled experiment?</p>
</blockquote>

<h2 id="howareallthesethingsrelated">How are all these things related?</h2>

<p>Let's start with a diagram that shows what's going on if we only care about $p(y\vert x)$, i.e. the simple supervised learning case:</p>

<p><img src="http://www.inference.vc/content/images/2018/05/Causality-0_-just-observational.png" alt="ML beyond Curve Fitting: An Intro to Causal Inference and do-Calculus"></p>

<p>Let's say we observe 3 variables, $x, z, y$, in this order. Data is sampled i.i.d. from some observable joint distribution over 3 variables, denoted by the blue factor graph labelled 'observable joint'. If you don't know what a factor graph is, it's not important, the circles represent random variables, the little square represents a joint distribution of the variables it's connected to. We are interested in predicting $y$ from $x$, and say that $z$ is a third variable which we do not want to infer but we can also measure (I included this for completeness). The observational conditional $p(y\vert x)$ is calculated from this joint via simple conditioning. From the training data we can build a model $q(y\vert x;\theta)$ to approximate this conditional, for example using a deep net minimizing cross-entropy or whatever.</p>

<p>Now, what if we're actually interested in $p(y\vert do(x))$ rather than $p(y\vert x)$? This is what it looks like:</p>

<p><img src="http://www.inference.vc/content/images/2018/05/Causality-2_-two-distros.png" alt="ML beyond Curve Fitting: An Intro to Causal Inference and do-Calculus"></p>

<p>So, we still have the blue observed joint and data is still sampled from this joint. However, the object we wish to estimate is on the bottom right, the red intervention conditional $p(y\vert do(x))$. This is related to the intervention joint which is denoted by the red factor graph above it. It's a joint distribution over the same domain as $p$ but it's a different distribution. If we could sample from this red distribution (e.g. actually run a randomized controlled trial where we get to pick $x$), the problem would be solved by simple supervised learning. We could generate data from the red joint, and estimate a model directly from there. However, we assume this is not possible, and all we have is data sampled from the blue joint. We have to see if we can somehow estimate the red conditional $p(y\vert do(x))$ from the blue joint.</p>

<h3 id="causalmodels">Causal models</h3>

<p>If we want to establish a connection between the blue and the red joints, <em>we must</em> introduce additional assumptions about the causal structure of the data generating mechanism. The only way we can make predictions about how our distribution changes as a consequence of an interaction is if we know how the variables are causally related. This information about causal relationships is not captured in the joint distribution alone. We have to introduce something more expressive than that. Here is how what this looks like:</p>

<p><img src="http://www.inference.vc/content/images/2018/05/Causality_-building-a-bridge--1-.png" alt="ML beyond Curve Fitting: An Intro to Causal Inference and do-Calculus"></p>

<p>In addition to the observable joint we now also have a causal model of the world (top left) This causal model contains more detail than the joint distribution: it knows not only that pressure and barometer readings are dependent but also that pressure causes the barometer to go up and not the other way around. The arrows in this model correspond to the assumed direction of causation, and the absence of an arrow represents the absence of direct causal influence between variables. The mapping from causal diagrams to joint distributions is many-to-one: several causal diagrams are compatible with the same joint distribution. Thus, it is  generally impossible to conclusively choose between different causal explanations by looking at observed data only.</p>

<p>Coming up with a causal model is a modeling step where we have to consider assumptions about how the world works, what causes what. Once we have a causal diagram, we can emulate the effect of intervention by mutilating the causal network: deleting all edges that lead into nodes in a $do$ operator. This is shown on the middle-top panel. The mutilated causal model then gives rise to a joint distribution denoted by the green factor graph. This joint has a corresponding conditional distribution $\tilde{p}(y\vert do(x))$, which we can use as our approximation of $p(y\vert do(x))$. If we got the causal structure qualitatively right (i.e. there are no missing nodes and we got the direction of arrows all correct), this approximation is exact and $\tilde{p}(y\vert do(x)) = p(y\vert do(x))$. If our causal assumptions are wrong, the approximation may be bogus.</p>

<p>Critically, to get to this green stuff, and thereby to establish the bridge between observational data and interventional distributions, we had to combine data with additional assumptions, prior knowledge if you wish. Data alone would not enable us to do this.</p>

<h3 id="docalculus">Do-calculus</h3>

<p>Now the question is, how can we say anything about the green conditional when we only have data from the blue distribution. We are in a better situation than before as we have the causal model relating the two. To cut a long story short, this is what the so-called <em>do-calculus</em> is for. Do-calculus allows us to massage the green conditional distribution until we can express it in terms of various marginals, conditionals and expectations under the blue distribution. Do-calculus extends our toolkit of working with conditional probability distributions with four additional rules we can apply to conditional distributions with the $do$ operators in them. These rules take into account properties of the causal diagram. The details can't be compressed into a single blog post, but here is <a href="https://arxiv.org/abs/1305.5506">an introductory paper on them.</a>.</p>

<p>Ideally, as a result of a do-calculus derivation you end up with an equivalent formula for $\tilde{p}(y\vert do(x))$ which no longer has any do operators in them, so you estimate it from observational data alone. If this is the case we say that the causal query $\tilde{p}(y\vert do(x))$ is <em>identifiable</em>. Conversely, if this is not possible, no matter how hard we try  applying do-calculus, we call the causal query <em>non-identifiable</em>, which means that we won't be able to estimate it from the data we have. The diagram below summarizes this causal inference machinery in its full glory.</p>

<p><img src="http://www.inference.vc/content/images/2018/05/Causality_-do-calculus-estimand--1-.png" alt="ML beyond Curve Fitting: An Intro to Causal Inference and do-Calculus"></p>

<p>The new panel called "estimable formula" shows the equivalent expression for $\tilde{p}(y\vert do(x))$ obtained as a result of the derivation including several do-calculus rules. Notice how the variable $z$ which is completely irrelevant if you only care about $p(y\vert x)$ is now needed to perform causal inference. If we can't observe $z$ we can still do supervised learning, but we won't be able to answer causal inference queries $p(y\vert do(x))$.</p>

<h2 id="howdoiknowmycausalmodelisrightorwrong">How do I know my causal model is right or wrong?</h2>

<p>You can never fully verify the validity and completeness of your causal diagram based on observed data alone. However, there are certain aspects of the causal model which are empirically testable. In particular, the causal diagram implies certain conditional independence or dependence relationships between sets of variables. These dependencies or independencies can be empirically tested, and if they are not present in the data, that is an indication that your causal model is wrong. Taking this idea forward you can attempt to perform full causal discovery: attempting to infer the causal model or at least aspects of it, from empirical data.</p>

<p>But the bottom line is: a full causal model is a form of prior knowledge that you have to add to your analysis in order to get answers to causal questions without actually carrying out interventions. Reasoning with data alone won't be able to give you this. Unlike priors in Bayesian analysis - which are a nice-to-have and can improve data-efficiency -  causal diagrams in causal inference are a must-have. With a few exceptions, all you can do without them is running randomized controlled experiments.</p>

<h2 id="summary">Summary</h2>

<p>Causal inference is indeed something fundamental. It allows us to answer "what-if-we-did-x" type questions that would normally require controlled experiments and explicit interventions to answer. And I haven't even touched on counterfactuals which are even more powerful.</p>

<p>You can live without this in some cases. Often, you really just want to do normal inference. In other applications such as model-free RL, the ability to explicitly control certain variables may allow you to sidestep answering causal questions explicitly. But there are several situations, and very important applications, where causal inference offers the only method to solve the problem in a principled way.</p>

<p>I wanted to emphasize again that this is not a question of whether you work on deep learning or causal inference. You can, and in many cases you should, do both. Causal inference and do-calculus allows you to understand a problem and establish what needs to be estimated from data based on your assumptions captured in a causal diagram. But once you've done that, you still need powerful tools to actually estimate that thing from data. Here, you can still use deep learning, SGD, variational bounds, etc. It is this cross-section of deep learning applied to causal inference which the recent article with Pearl claimed was under-explored.</p>

<p>UPDATE: In the comments below people actually pointed out some relevant papers (thanks!). If you are aware of any work, please add them there.</p>]]></content:encoded></item><item><title><![CDATA[The Lottery Ticket Hypothesis - Paper Recommendation]]></title><description><![CDATA[<p>I wanted to highlight a recent paper I came across, which is also a nice follow-up to my <a href="http://www.inference.vc/pruning-neural-networks-two-recent-papers/">earlier post</a> on pruning neural networks:</p>

<ul>
<li>J Frankle and M Carbin (2018) <a href="https://arxiv.org/abs/1803.03635">The Lottery Ticket Hypothesis: Training Pruned Neural Networks</a></li>
</ul>

<p>Many people working on network pruning observed that, starting from a wide</p>]]></description><link>http://www.inference.vc/the-lottery-ticket-hypothesis/</link><guid isPermaLink="false">bbb9aefb-1036-40e7-bdb9-012e1d21695f</guid><dc:creator><![CDATA[Ferenc Huszar]]></dc:creator><pubDate>Thu, 10 May 2018 13:42:38 GMT</pubDate><media:content url="http://www.inference.vc/content/images/2018/05/lotteryticket.jpg" medium="image"/><content:encoded><![CDATA[<img src="http://www.inference.vc/content/images/2018/05/lotteryticket.jpg" alt="The Lottery Ticket Hypothesis - Paper Recommendation"><p>I wanted to highlight a recent paper I came across, which is also a nice follow-up to my <a href="http://www.inference.vc/pruning-neural-networks-two-recent-papers/">earlier post</a> on pruning neural networks:</p>

<ul>
<li>J Frankle and M Carbin (2018) <a href="https://arxiv.org/abs/1803.03635">The Lottery Ticket Hypothesis: Training Pruned Neural Networks</a></li>
</ul>

<p>Many people working on network pruning observed that, starting from a wide network and pruning it, one obtains better performance than training the slim, pruned architecture from scratch (random initialization). This suggests that the added redundancy of an overly wide network is somehow useful in training.</p>

<p>In this paper, Frankle and Carbin present the <em>lottery ticket hypothesis</em> to explain these observations. According to the hypothesis, good performance results from lucky initialization of a subnetwork or subcomponent of the original network. Since fat networks have exponentially more component subnetworks, starting from a fatter network increases the effective number of lottery tickets, thereby increasing the chances of containing a winning ticket. According to this hypothesis, pruning effectively identifies the subcomponent which is the winning ticket.</p>

<p>It's important to note that good initialization is a somewhat underrated but extremely important component of SGD-based deep learning. Indeed, I often find that if you use non-trivial architectures, training is essentially stuck until you tweak the off-the-shelf initialization schemes so that it somehow starts to work.</p>

<p>To test the hypothesis the authors have designed a set of cool experiments. The experiments basically go like this:</p>

<ol>
<li>start from a fat network  </li>
<li>train it  </li>
<li>prune it  </li>
<li>take the resulting skinny network and reset weights to their initial values (they call this the winning ticket)  </li>
<li>train the winning ticket network  </li>
<li>as a control, reinitialize the weights of the winning ticket randomly, and train that</li>
</ol>

<p>Here are some of the results from Figure 4.</p>

<p><img src="http://www.inference.vc/content/images/2018/05/Screen-Shot-2018-05-10-at-2.08.57-PM.png" alt="The Lottery Ticket Hypothesis - Paper Recommendation"></p>

<p>Observe two main things in these results. The winning tickets, without the remaining redundancy of the wide network, train faster than the wide network. In fact, the skinnier they are, the faster they train (within reason). However, if you reinitialize the networks' weights randomly (control), the resulting nets now train slower than full network. Therefore, pruning is not just about finding the right architecture, it's also about identifying the 'winning ticket', which is a particularly luckily initialized subcomponent of the network.</p>

<p>I thought this paper was pretty cool. It underscores the role that (random) initialization plays in successfull training of neural networks, and gives an interesting justification for why redundancy and over-parametrization might actually be a desirable property when combined with SGD from random init.</p>

<p>There are various ways this line of research can be extended and built upon. For a starter, it would be great to see how the results hold up when better pruning schemes and more general classes of initialization approaches are used. It would be interesting to see if the training+pruning method consistently identifies the same winning ticket, or if it finds different winning tickets each time. One could look at various properties of randomly sampled tickets and compare them to the winning tickets. For example, I would be very curious to see what these winning tickets look like on the <a href="https://arxiv.org/abs/1703.00810">information plane</a>. Similarly, it would be great to find cheaper ways or heuristics to identify winning tickets without having to train and prune the fat network.</p>]]></content:encoded></item><item><title><![CDATA[Goals and Principles of Representation Learning]]></title><description><![CDATA[<p>This is a post about my takeaways from the DALI workshop on the <a href="http://dalimeeting.org/dali2018/workshopRepresentationLearning.html">Goals and Principles of Representation Learning</a> which we co-organized with DeepMinders <a href="http://shakirm.com/">Shakir Mohamed</a> and <a href="https://www.cs.toronto.edu/~amnih/">Andriy Mnih</a> and Twitter colleague <a href="http://theis.io/">Lucas Theis</a>. We had an amazing set of presentations, videos are <a href="https://www.youtube.com/playlist?list=PL-tWvTpyd1VAlbzhCpljlREd76Nlo1pOo">available here</a>.</p>

<p>DALI is my favourite meeting</p>]]></description><link>http://www.inference.vc/goals-and-principles-of-representation-learning/</link><guid isPermaLink="false">5ef782e6-8c14-4e59-a604-a23351e29bf9</guid><dc:creator><![CDATA[Ferenc Huszar]]></dc:creator><pubDate>Thu, 12 Apr 2018 13:51:41 GMT</pubDate><media:content url="http://www.inference.vc/content/images/2018/04/288kum-1.jpg" medium="image"/><content:encoded><![CDATA[<img src="http://www.inference.vc/content/images/2018/04/288kum-1.jpg" alt="Goals and Principles of Representation Learning"><p>This is a post about my takeaways from the DALI workshop on the <a href="http://dalimeeting.org/dali2018/workshopRepresentationLearning.html">Goals and Principles of Representation Learning</a> which we co-organized with DeepMinders <a href="http://shakirm.com/">Shakir Mohamed</a> and <a href="https://www.cs.toronto.edu/~amnih/">Andriy Mnih</a> and Twitter colleague <a href="http://theis.io/">Lucas Theis</a>. We had an amazing set of presentations, videos are <a href="https://www.youtube.com/playlist?list=PL-tWvTpyd1VAlbzhCpljlREd76Nlo1pOo">available here</a>.</p>

<p>DALI is my favourite meeting of the year. It's small and and <a href="http://dalimeeting.org/participants">participants</a> are all very experienced and engaged. To make most of the high quality audience we decided to focus on slightly controversial topics, and asked speakers to present their general thoughts, their unique perspectives and opinions rather than talk about their latest research project. Broadly, we asked three types of questions:</p>

<ol>
<li>What tasks do we use unsupervised representation learning for?  </li>
<li>What principles can help us do well on the tasks  </li>
<li>How can we turn principles into to actual objective functions and algorithms.</li>
</ol>

<h3 id="taskprinciplelossfunctonalgorithm">Task → Principle → Loss functon → … → Algorithm</h3>

<p>From my perspective, the end-goal of any area of ML research should be for each algorithm to have a justification from first principles. Like the chain above starting from the Task all the way down to the precise algorithm we're going to run. </p>

<p>In my <a href="http://www.inference.vc/design-patterns/">machine learning cookbook post</a> I reviewed a number of fairly well understood problem transformations which one could apply to the <em>bottom</em> of this chain, typically to turn an intractable optimisation problem to a tractable one: variational inference, evolution strategies, convex relaxations, etc. In the workshop we focussed on the <em>top</em> of the chain, the <a href="http://blog.shakirm.com/2013/04/marrs-levels-of-analysis/">computational level understanding</a>:  Do we actually have clarity on the ultimate task we solve? If this the task is given, is the principle we're applying actually justified? Is the loss function consistent with the principle?</p>

<h3 id="transferlearning">Transfer learning</h3>

<p>One thing that most speakers and attendees seemed to agree on is that the most interesting and most important task we use unsupervised representation learning for is to transfer knowledge from large datasets to new tasks. I'll try to capture one possible formulation of what this means:</p>

<ul>
<li>we have some large auxiliary dataset $\mathcal{D}$.</li>
<li>we have a future ML task $T$ which we have to learn to solve in the future. We may not have complete knowledge as to what this task is.</li>
<li>The goal of transfer learning is to <strong>increase our data-efficiency</strong> in solving $T$, using side information derived from the auxilliary dataset $\mathcal{D}$.</li>
</ul>

<p>There are many possible extensions of this basic setup. For example we may have a distribution of tasks, rather than a single task.</p>

<h3 id="disentanglement">Disentanglement</h3>

<p>One of the principles we discussed was disentanglement. It is somewhat unclear what exactly disentanglement is, but one way to capture this is to say "each coordinate in the representation corresponds to one meaningful factor of variation". Of course this is kind of a circular definition inasmuch as it's not easy to define a meaningful factor of variation. Irina defined disentangled representations as "factorized and interpretable", which, as a definition, has its own issues, too.</p>

<p>In the context of transfer learning though, I can see a vague argument for why seeking disentanglement might be a useful principle. One way to improve data efficiency of a ML algorithm is to endow it with inductive biases or to reduce the complexity of the function class in some sense. If we believe that future downstream tasks we may want to solve can be solved with simple, say linear, models on top of a disentangled representation, then seeking disentanglement makes sense. It may be possible to use this argument itself to formulate a more precise definition of or objective function for disentanglement.</p>

<h3 id="selfsupervision">Self-supervision</h3>

<p>The idea of self-supervision also featured prominently in talks. Self-supervision is the practice of defining auxiliary supervised learning tasks involving the unlabelled data, with hopes that solving these auxiliary tasks will require us to build representations which are also useful in the eventual downstream task we'd like to solve. Autoencoders, pseudolikelihood, denoising, <a href="http://www.inference.vc/notes-on-unsupervised-learning-of-visual-representations-by-solving-jigsaw-puzzles/">jigsaw-puzzles</a>, <a href="http://www.inference.vc/temporal-contrastive-learning-for-latent-variable-models/">temporal-contrastive learning</a> and so on are examples of this approach.</p>

<p>One very interesting aspect Harri talked about was the question of relevance? How can we make sure that an auxiliary learning task is relevant for the primary inference task we need to solve. In a semi-supervised learning setup Harri showed how you can adapt an auxiliary denoising task to be more relevant: you can calculate a sort of saliency map to determine which input dimensions the primary inference network pays attention to, and then use saliency-weighted denoising as the auxiliary task.</p>

<h3 id="arerepresentationsthesolution">Are representations the solution?</h3>

<p>After the talks we had an open discussion session, hoping for a heated debate with people representing opposing viewpoints. And the audience really delivered on this. 10 minutes or so into the discussion Zoubin walked in and stated "I don't believe in representation learning". I tried to capture the main points in the argument (dad being anti-representation) below:</p>

<p><img src="http://www.inference.vc/content/images/2018/04/288kum.jpg" alt="Goals and Principles of Representation Learning"></p>

<p>Retrospectively, I would summarise Zoubin's (and others') criticism as follows: If we identified transfer learning as the primary task representation learning is supposed to solve, are we actually sure that representation learning is the way to solve it? Indeed, this is not one of the questions I asked in my slides, and it's a very good one. One can argue that there may be many ways to transfer information from some dataset over to a novel task. Learning a representation and transferring that is just one approach. Meta-learning, for example, might provide another approach.</p>

<h3 id="summary">Summary</h3>

<p>I have really enjoyed this workshop. The talks were great, participants were engaged, and the discussion at the end was a fantastic ending for DALI itself. I am hoping that, just like me, others have left the workshop with many things to talk about, and that the debates we had will inspire future work on representation learning (or anti-representation learning).</p>]]></content:encoded></item><item><title><![CDATA[Pruning Neural Networks: Two Recent Papers]]></title><description><![CDATA[<p>I wanted to briefly highlight two recent papers on pruning neural networks (disclaimer, one of them is ours):</p>

<ul>
<li>Christos Louizos, Max Welling, Diederik P. Kingma (2018) <a href="https://arxiv.org/abs/1712.01312">Learning Sparse Neural Networks through $L_0$ Regularization</a></li>
<li>Lucas Theis, Iryna Korshunova, Alykhan Tejani, Ferenc Huszár (2018) <a href="https://arxiv.org/abs/1801.05787">Faster gaze prediction with dense networks and</a></li></ul>]]></description><link>http://www.inference.vc/pruning-neural-networks-two-recent-papers/</link><guid isPermaLink="false">dc88fa1f-79e8-4a84-850f-4e5123ea80f8</guid><dc:creator><![CDATA[Ferenc Huszar]]></dc:creator><pubDate>Tue, 06 Feb 2018 15:41:46 GMT</pubDate><media:content url="http://www.inference.vc/content/images/2018/02/Screen-Shot-2018-02-05-at-5.05.39-PM-1.png" medium="image"/><content:encoded><![CDATA[<img src="http://www.inference.vc/content/images/2018/02/Screen-Shot-2018-02-05-at-5.05.39-PM-1.png" alt="Pruning Neural Networks: Two Recent Papers"><p>I wanted to briefly highlight two recent papers on pruning neural networks (disclaimer, one of them is ours):</p>

<ul>
<li>Christos Louizos, Max Welling, Diederik P. Kingma (2018) <a href="https://arxiv.org/abs/1712.01312">Learning Sparse Neural Networks through $L_0$ Regularization</a></li>
<li>Lucas Theis, Iryna Korshunova, Alykhan Tejani, Ferenc Huszár (2018) <a href="https://arxiv.org/abs/1801.05787">Faster gaze prediction with dense networks and Fisher pruning</a></li>
</ul>

<p>What I generally refer to as pruning in the title of this post is reducing or controlling the number of non-zero parameters, or the number of featuremaps actively used in a neural network. At a high level, there are at least three ways one can go about this, pruning is really only one of them:</p>

<ul>
<li><em>regularization</em> modifies the objective function/learning problem so the optimization is likely to find a neural network with small number of parameters. Louizos et al, (2018) choose this approach.</li>
<li><em>pruning</em> takes a large network and deletes features or parameters that are in some sense redundant (Theis et al, 2018) is an example of this</li>
<li><em>growing</em>: although less wide-spread you can take a third approach where, starting from a small network, you incrementally add new units by some growth criterion</li>
</ul>

<h2 id="whydothis">Why do this?</h2>

<p>There are different reasons for pruning your network. The most obvious, perhaps, is to reduce computational cost while keeping the same performance. Removing features which aren't really used in your deep network architecture can speed up inference as well as training. You can think also think of pruning as a form of architecture search: figuring out how many features you need in each layer for best performance.</p>

<p>The second argument is to improve generalization by reducing the number of parameters, and thus the redundancy in the parameter space. As we have seen in recent work on generalization in deep networks, the raw number of parameters ($L_0$ norm) is not actually a sufficient predictor of their generalization ability. That said, we empirically find that pruning a network tends to help generalization. Meanwhile, the community is developing (or, putting my Schmidhuber-hat on: maybe in some cases rediscovering) new parameter-dependent quantities to predict/describe generalization. The <a href="http://www.inference.vc/generalization-and-the-fisher-rao-norm-2/">Fisher-Rao norm</a> is a great example of these. Interestingly, Fisher pruning <a href="https://arxiv.org/abs/1801.05787">(Theis et al, 2018)</a> turns out to have a nice connection to the Fisher-Rao norm, and this may hint at a deeper relationship between pruning, parameter redundancy and generalization.</p>

<h2 id="dl_0dregularization">$L_0$ regularization</h2>

<p>I found the $L_0$ paper by <a href="https://arxiv.org/abs/1712.01312">Louizos et al, (2018)</a> very interesting in that it can be seen as a straightforward application of the machine learning problem transformations I wrote up in the <a href="http://www.inference.vc/design-patterns/">machine learning cookbook</a> a few months ago. It's a good illustration of how you can use these general ideas go from formulating an intractable ML optimization problem to something practical you can run SGD on.</p>

<p>So I will summarize the paper as a series of steps, each changing the optimization problem:</p>

<ol>
<li>start from an ideal loss function which may be intractable to optimize: the usual training loss plus the $L_0$ norm of parameters, combined linearly. The $L_0$ norm simply counts non-zero entries in a vector, a non-differentiable piecewise constant function. This is a difficult, combinatorial optimization problem.  </li>
<li>apply <a href="http://www.inference.vc/evolution-strategies-variational-optimisation-and-natural-es-2/">variational optimization</a> to turn the non-differentiable function into a differentiable one. This generally works by introducing a probability distribution $p_{\psi}(\theta)$ over parameters $\theta$. Even if the objective is non-differentiable with respect to any $\theta$, the average loss under $p_{\psi}$ may be differentiable w.r.t. $\psi$. To find the optimal $\psi$, one can generally use a REINFORCE gradient estimator, which results in evolution strategies. But ES generally has very high variance, so we  </li>
<li>apply the <strong>reparametrization trick</strong> to $p_\psi$ to construct a lower-variance gradient estimator. This, however, only works for continuous variables. To deal with the discreteness, we turn to a  </li>
<li><strong>concrete relaxation</strong>, which approximates the discrate random variable by a continuous approximation. Now we have a lower-variance (compared to REINFORCE) gradient estimator which one can calculate via backprop and simple Monte Carlo sampling. You can use these gradients in SGD (Adam), which is what the paper does.</li>
</ol>

<p>Interestingly, the connection between Eq. (3) and evolution strategies or variational optimization is not mentioned. Instead, a motivation based on a different connection to spike-and-slab priors is given. I recommend reading the paper, perhaps with this connection in mind.</p>

<p>The authors then show that this indeed works, and compares favourably to other methods designed to reduce the number of parameters. <br>
Thinking about the paper in terms of these steps converting from one problem to another allows you to generalize or improve the idea. For example, the <a href="https://arxiv.org/abs/1703.07370">REBAR</a> or <a href="https://arxiv.org/abs/1711.00123">RELAX</a> gradient estimators provide an unbiased and lower-variance alternative to the concrete relaxation, which may work very well here, too.</p>

<h2 id="fisherpruning">Fisher pruning</h2>

<p>The second paper I wanted to talk about is something from our own lab. Rather than being a purely methods paper, <a href="https://arxiv.org/abs/1801.05787">(Theis et al, 2018)</a> focusses on the specific application of building speedy neural networks to predict saliency in an image. The pruned network now powers the logic behind <a href="https://blog.twitter.com/engineering/en_us/topics/infrastructure/2018/Smart-Auto-Cropping-of-Images.html">cropping photos on Twitter</a>.</p>

<p>Our goal, too, was to reduce computational cost of the network, and specifically in the transfer learning setting: when building on top of a pre-trained neural network, you inherit a lot of complexity required to solve the original <em>source</em> task, much of which may be redundant for solving your <em>target</em> task. There is a difference in our high-level pruning objective: unlike $L_0$ norm or group sparsity, we used a slightly more complicated formula to directly estimates the forward pass runtime of the method. This is a quadratic function of the number of parameters at each layer with interactions between neighbouring layers. Interestingly, this results in architectures which tend to alternate thick and thin layers, like the one below:</p>

<p><img src="http://www.inference.vc/content/images/2018/02/Screen-Shot-2018-02-05-at-5.05.39-PM.png" alt="Pruning Neural Networks: Two Recent Papers"></p>

<p>We prune the trained network greedily by removing convolutional featuremaps one at a time. A meaningful principle for selecting the next feature map to prune is to minimize the resulting increase in training loss. Starting from this criterion, using a second order Taylor-expansion of the loss, making some more assumptions, we obtain the following <em>pruning signal</em> for keeping a parameter $\theta_i$:</p>

<p>$$
\Delta_i \propto F_i \theta_i^2,
$$</p>

<p>where $F_i$ denotes the $i^{th}$ diagonal entry of the Fisher information matrix. The above formula deals with removing a single parameter, but we can generalize this to removing entire featuremaps. Pruning proceeds by removing the parameter or featuremap with the smallest $\Delta$ in each iteration, and retraining the network between iterations. For more details, please see the paper.</p>

<p>Adding to what's presented in the paper, I wanted to point out a few connections between Fisher pruning to ideas I discussed on this blog before.</p>

<h4 id="fisherraonorm">Fisher-Rao norm</h4>

<p>The first connection is to the Fisher-Rao norm. Assume for a minute that the Fisher information is diagonal - a big and unreasonable assumption in theory, but a pragmatic simplification resulting in useful algorithms in practice. With this assumption, the Fisher-Rao norm of $\theta$ becomes:</p>

<p>$$
\|\theta\|_{fr} = \sum_{i=1}^{I} F_i \theta_i^2
$$</p>

<p>Written in this form, you can hopefully see the connection between the FR-norm and the Fisher pruning criterion. Depending on the particular definition of Fisher information used, you can interpret the FR-norm, approximately, as</p>

<ul>
<li>the expected drop in training log likelihood (empirical Fisher Info) as you remove a random parameter, or as</li>
<li>the approximate change in the conditional distribution defined by the model (model Fisher info) as we remove a parameter</li>
</ul>

<p>In the real world, the Fisher info matrix is not diagonal, and this is actually an important aspect of understanding generalization. For one, considering only diagonal entries makes Fisher pruning sensitive to certain reparametrizations (ones with non-diagonal Jacobian) of the network. But maybe there is a deeper connection to be observed here between Fisher-Rao norms and the redundancy of parameters.</p>

<h4 id="elasticweightconsolidation">Elastic Weight Consolidation</h4>

<p>Using the diagonal Fisher information values to guide pruning also bears resemblance to <a href="http://www.inference.vc/comment-on-overcoming-catastrophic-forgetting-in-nns-are-multiple-penalties-needed-2/">elastic weight consolidation</a> by <a href="http://www.pnas.org/content/114/13/3521.full">(Kirkpatrick et al, 2017)</a>. In EWC, the Fisher information values are used to establish which weights are more or less important for solving previous tasks. There, the algorithm was derived from the perspective Bayesian on-line learning, but you can also motivate it from <a href="https://arxiv.org/abs/1712.03847">a Taylor expansion perspective</a> just like Fisher pruning.</p>

<p>The metaphor I use to understand and explain EWC is that of a shared hard drive. (WARNING: like all metaphors, this may be completely missing the point). The parameters of a neural network are like a hard drive or storage volume of some sort. Training the NN on a task involves compressing the training data and <em>saving the information</em> to the hard drive. If you have no mechanism to keep data from being overwritten, it's going to be overwritten: in neural networks, catastrophic forgetting occurs the same way. EWC is like a protocol for sharing the hard-drive between multiple users, without the users overwriting each other's data. The Fisher information values in EWC can be seen as soft do-not-overwrite flags. After training on the first task, we calculate the Fisher information values which say which parameters store crucial information for the task. The ones with low Fisher value are redundant and can be reused to store new info. In this metaphor, it is satisfying to think about the sum of Fisher information values as measuring how full the hard-drive is, and pruning as throwing away parts of the drive not actually used to store anything.</p>

<h2 id="summary">Summary</h2>

<p>I wrote about two recent methods for automatically learning neural network architecture by figuring out which parameters/features to throw away. In my mind, both methods/papers are interesting in their own right. The $L_0$ approach seems like a simpler optimization algorithm that may be preferable to the iterative, remove-one-feature-at-a-time nature of Fisher pruning. However, Fisher pruning is more applicable to the scenario when you start from a large pretrained model in a transfer learning setting.</p>]]></content:encoded></item><item><title><![CDATA[My notes on (Liang et al., 2017): Generalization and the Fisher-Rao norm]]></title><description><![CDATA[<p>After last week's post on the generalization mystery, people have pointed me to recent work connecting the Fisher-Rao norm to generalization (thanks!):</p>

<ul>
<li>Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, James Stokes (2017) <a href="https://arxiv.org/abs/1711.01530">Fisher-Rao Metric, Geometry, and Complexity of Neural
Networks</a></li>
</ul>

<p>This <a href="http://www.mit.edu/~rakhlin/papers/myths.pdf">short presentation on generalization</a> by the coauthors Sasha Rakhlin is</p>]]></description><link>http://www.inference.vc/generalization-and-the-fisher-rao-norm-2/</link><guid isPermaLink="false">b8067af7-72bc-49fe-bb16-71c9bba34e29</guid><dc:creator><![CDATA[Ferenc Huszar]]></dc:creator><pubDate>Thu, 25 Jan 2018 23:43:49 GMT</pubDate><media:content url="http://www.inference.vc/content/images/2018/01/download-14-1.png" medium="image"/><content:encoded><![CDATA[<img src="http://www.inference.vc/content/images/2018/01/download-14-1.png" alt="My notes on (Liang et al., 2017): Generalization and the Fisher-Rao norm"><p>After last week's post on the generalization mystery, people have pointed me to recent work connecting the Fisher-Rao norm to generalization (thanks!):</p>

<ul>
<li>Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, James Stokes (2017) <a href="https://arxiv.org/abs/1711.01530">Fisher-Rao Metric, Geometry, and Complexity of Neural
Networks</a></li>
</ul>

<p>This <a href="http://www.mit.edu/~rakhlin/papers/myths.pdf">short presentation on generalization</a> by the coauthors Sasha Rakhlin is also worth looking at - though I have to confess much of the references to learning theory are lost on mesho.</p>

<p>While I can't claim to have understood all the bounding and proofs going on in Section 4, I think I got the big picture so I will try and summarize the main points in the section below. In addition, I wanted to add some figures I did which helped me understand the restricted model class the authors worked with and to understand the "gradient structure" this restriction gives rise to. Feel free to point out if anything I say here is wrong or incomplete.</p>

<h2 id="normbasedcapacitycontrol">Norm-based capacity control</h2>

<p>The main mantra of this paper is along the lines of <a href="http://ieeexplore.ieee.org/abstract/document/661502/">results by Bartlett (1998)</a> who observed that in neural networks, generalization is about the size of the weights, not the number of weights. This theory underlies the use of techniques such as weight decay and even early stopping, since both can be seen as ways to keep the neural network's weight vector small. Reasoning about a neural network's generalization ability in terms of the size, or norm, of its weight vector is called <em>norm-based capacity control</em>.</p>

<p>The main contribution of Liang et al (2017) is proposing the Fisher-Rao norm as a measure of how big the networks' weights are, and hence as an indicator of a trained network's generalization ability. It is defined as follows:</p>

<p>$$
\|\theta\|_{fr} = \theta^\top I_\theta \theta
$$</p>

<p>where $I$ is the Fisher information matrix:</p>

<p>$$
I(\theta) = \mathbb{E}_{x,y} \left[ \nabla_\theta \ell(f_\theta(x),y) \nabla_\theta \ell(f_\theta(x),y)^\top \right] <br>
$$</p>

<p>There are various versions of the Fisher information matrix, and therefore of the Fisher-Rao norm, depending on which distribution the expectation is taken under. The empirical form samples both $x$ and $y$ from the empirical data distribution. The model form samples $x$ from the data, but assumes that the loss is a log-loss of a probabilistic model, and we sample $y$ from this model.</p>

<p>Importantly, the Fisher-Rao norm is something which depends on the data distribution (at least the distribution of $x$). It is also invariant under reparametrization, which means that if there are two parameters $\theta_1$ and $\theta_2$ which implement the same function, their FR-norm is the same. Finally, it is a measure related to flatness inasmuch as the Fisher-information matrix approximates the Hessian at a minimum of the loss under certain conditions.</p>

<h2 id="summaryofthepapersmainpoints">Summary of the paper's main points:</h2>

<ul>
<li>in rectified linear networks without bias, the Fisher-Rao norm has an equivalent form which is easier to calculate and analyze (more on this below)</li>
<li>various other norms previously proposed to bound model complexity can be bounded by the Fisher-Rao norm</li>
<li>for linear networks without rectification (which in fact is just a linear function) the Rademacher complexity can be bounded by the FR-norm and the bound does not depend on the number of layers or number of units per layer. This suggests that the Fisher-Rao norm may be a good, modelsize-independent proxy measure of generalization.</li>
<li>the authors do not prove bounds for more general classes of models, but provide intuitive arguments based on its relationship to other norms.</li>
<li>the authors also did a bunch of experiments to show how the FR-norm correlates with generalization performance. They looked at both vanilla SGD and 2-nd order stochastic method K-FAC. They looked at what happens if we mix in random labels to training, and found that the FR-norm of the final solution seems to track the generalization gap.</li>
<li>There are still open questions, for example explaining what, specifically, makes SGD converge to better minima, and how this changes with increasing batchsize.</li>
</ul>

<h2 id="rectifiedneuralnetworkswithoutbias">Rectified neural networks without bias</h2>

<p>The one thing I wanted to add to this paper, is a little bit more detail on the particular model class - rectified linear networks <strong>without bias</strong> - that the authors studied here. This restriction turns out to guarantee some very interesting properties, without hurting the empirical performance of the networks (so the authors claim and to some degree demonstrate).</p>

<p>Let's first visualize what the output of a rectified multilayer perceptron <em>with biases</em> looks like. Here I used 3 hidden layers with 15 ReLUs in each and PyTorch-default random initialization. The network's input is 2D, and the output is 1D so I can easily plot contour surfaces:</p>

<h5 id="figure1rectifiednetworkwithbiases">Figure 1: rectified network with biases</h5>

<p><img src="http://www.inference.vc/content/images/2018/01/download-12.png" alt="My notes on (Liang et al., 2017): Generalization and the Fisher-Rao norm"></p>

<p>The left-hand panel shows the function itself. The panels next to it show the gradients with respect to $x_1$ and $x_2$ respectively. The function is piecewise linear (which is hard to see because there are many, many linear pieces), which means that the gradients are piecewise constant (which is more visually apparent).</p>

<p>The piecewise linear structure of $f$ becomes more apparent we superimpose the contour plot of the graidents (black) on top of the contour plot of $f$ itself (red-blue):</p>

<p><img src="http://www.inference.vc/content/images/2018/01/download-14.png" alt="My notes on (Liang et al., 2017): Generalization and the Fisher-Rao norm"></p>

<p>These functions are clearly very flexible and by adding more layers, the number of linear pieces grows exponentially.</p>

<p>Importantly, the above plot would look very similar had I plotted the function's output as a function of two components of $\theta$, keeping $x$ fixed. This is significantly more difficult to plot though, so I'm hoping you'll just believe me.</p>

<p>Now let's look at what happens when we remove all biases from the network, keeping only the weight matrices:</p>

<h5 id="figure2rectifiednetworkwithoutbiases">Figure 2: rectified network without biases</h5>

<p><img src="http://www.inference.vc/content/images/2018/01/download-15.png" alt="My notes on (Liang et al., 2017): Generalization and the Fisher-Rao norm"></p>

<p>Wow, the function looks very different now, doesn't it? At $x=0$ it always takes the value $0$. It is composed of wedge-shaped (or in higher dimensions, generalized pyramid-shaped) regions within which the functino is linear but the slope in each wedge is different. Yet the surface is still continuous. Let's do the superimposed plot again:</p>

<p><img src="http://www.inference.vc/content/images/2018/01/download-17.png" alt="My notes on (Liang et al., 2017): Generalization and the Fisher-Rao norm"></p>

<p>It's less clear from these plots why a function like this can model data just as well as the more general piece-wise linear one we get if we enable biases. One thing that helps is dimensionality: in high dimensions, the probability that two randomly sampled datapoints fall into a the same "pyramind", i.e. share the same linear region, is extremely small. Unless your data has some structure that makes this likely to happen for many datapoints at once, you don't really have to worry about it, I guess.</p>

<p>Furthermore, if my network had three input dimensions, but I only use two dimensions $x_1$ and $x_2$ to encode data and fix the third coordinate $x_3=1$, I can implement the same kind of functions over my inputs. This is called using homogeneous coordinates, and a bias-less network with homogeneous coordinates can be nearly as powerful as one with biases in terms of the functions it can model. Below is an example of a function a rectified network with no biases can implement when using homogeneous coordinates.</p>

<h4 id="figure3rectifiednetworkwithoutbiaseshomogeneouscoordinates">Figure 3: rectified network without biases, homogeneous coordinates</h4>

<p><img src="http://www.inference.vc/content/images/2018/01/download-18.png" alt="My notes on (Liang et al., 2017): Generalization and the Fisher-Rao norm"></p>

<p>This is because the third variable $x_3=1$ multiplied by its weights practically becomes a bias for the first hidden layer.</p>

<p>Second observation is that we can consider $f_\theta(x)$ as a function of the weight matrix of a particular layer, keeping all other weights and the input the same, the function behaves exactly the same way as it behaves with respect to the input $x$. The same radial pattern would be observed in $f$ if I plotted it as a function of a weight matrix (though weight matrices are rarely 2-D so I can't really plot that).</p>

<h4 id="structureingradients">Structure in gradients</h4>

<p>The authors note that these functions satisfy the following formula: <br>
$$
f_\theta(x) = \nabla_x f_\theta(x)^\top x <br>
$$</p>

<p>(Moreover I think these are the only continuous functions for which the above equality holds, but I leave this to keen readers to prove or disprove)</p>

<p>Noting the symmetry between the network's inputs and weight matrices, a similar equality can be established with respect to parameters $\theta$:</p>

<p>$$
f_\theta(x) = \frac{1}{L+1}\nabla_\theta f_\theta(x)^\top \theta, <br>
$$</p>

<p>where $L$ is the number of layers.</p>

<h4 id="whywouldthisbethecase">Why would this be the case?</h4>

<p>Here's my explanation which differs slightly from the simple proof the authors give. A general rectified network is piecewise linear with respect to $x$, as discussed. The boundaries of the linear pieces, and the slope, changes as we change $\theta$. Let's fix $\theta$. Now, so long as $x$ and some $x_0$ fall within the same linear region, the function at $x$ equals its Taylor expansion around $x_0$:</p>

<p>\begin{align}
f_\theta(x) &amp;= \nabla_{x} f_\theta(x_0)^\top (x- x_0) + f_{\theta}(x_0) \\ <br>
   &amp;= \nabla_x f_\theta(x)^\top (x - x_0) + f_{\theta}(x_0)
\end{align}</p>

<p>Now, if we have no biases, all the linear segments are always wedge-shaped, and they all meet at the origin $x=0$. So, we can consider the limit of the above Taylor series in the limit as $x_0\rightarrow 0$. (we have to take a limit only technically as the function is non-differentiable at exactly $x=0$). As $f_{\theta}(0)=0$ we find that</p>

<p>$$
f_\theta(x) = \nabla_x f_\theta(x)^\top x, <br>
$$</p>

<p>just as we wanted. Now, treating layer $l$'s weights $\theta^{(l)}$ as if they were the input to the network consisting of the subsequent layers, and the previous layer's activations as if they were the weight multiplying these inputs, we can derive a similar formula in terms of $\theta^{(l)}$:</p>

<p>$$
f_\theta(x) = \nabla_{\theta^{(l)}} f_\theta(x)^\top \theta^{(l)}, <br>
$$</p>

<p>Applying this formula for all layers $l=1\ldots L+1$, and taking the average we obtain:</p>

<p>$$
f_\theta(x) = \frac{1}{L+1}\nabla_\theta f_\theta(x)^\top \theta <br>
$$</p>

<p>We got the $L+1$ from the $L$ hidden layers plus the output layer.</p>

<h4 id="expressionforfisherraonorm">Expression for Fisher-Rao norm</h4>

<p>Using the formula above, and the chain rule, we can simplify the expression for the Fisher-Rao norm as follows:</p>

<p>\begin{align}
\|\theta\|_{fr} &amp;= \mathbb{E} \theta^\top \nabla_\theta \ell(f_\theta(x),y) \nabla_\theta \ell(f_\theta(x),y)^\top \theta \\
&amp;= \mathbb{E} \left( \theta^\top \nabla_\theta \ell(f_\theta(x),y) \right)^2 \\
 &amp;= \mathbb{E} \left( \theta^\top \nabla_\theta f_\theta(x) \nabla_f \ell(f,y) \right)^2\\
 &amp;= \mathbb{E} \left( f_\theta(x)^\top \nabla_f \ell(f,y)\right)^2
\end{align}</p>

<p>It can be seen very clearly in this form that the Fisher-Rao norm only depends on the output of the function $f_\theta(x)$ and properties of the loss function. This means that if two parameters $\theta_1$ and $\theta_2$ implement the same input-output function $f$, their F-R norm will be the same.</p>

<h2 id="summary">Summary</h2>

<p>I think this paper presented a very interesting insight into the geometry of rectified linear neural networks, and highlighted some interesting connections between information geometry and norm-based generalization arguments.</p>

<p>What I think is still missing is the kind of insight which would explain why SGD finds solutions with low F-R norm, or how the F-R norm of a solution is effected by the batchsize of SGD, if at all it is. The other thing missing is whether the F-R norm can be an effective regularizer. It seems that for this particular class of networks which don't have any bias parameters, the model F-R norm could be calculated relatively cheaply and added to as a regularizer since we already calculate the forward-pass of the network anyway.</p>]]></content:encoded></item><item><title><![CDATA[The Generalization Mystery: Sharp vs Flat Minima]]></title><description><![CDATA[<p>I set out to write about the following paper I saw people talk about on twitter and reddit:</p>

<ul>
<li>Hao Li, Zheng Xu, Gavin Taylor, Tom Goldstein <a href="https://openreview.net/pdf?id=HkmaTz-0W">Visualizing the Loss Landscape of Neural Nets</a></li>
</ul>

<p>It's related to this pretty insightful paper:</p>

<ul>
<li>Laurent Dinh, Razvan Pascanu, Samy Bengio, Yoshua Bengio (2017) <a href="https://arxiv.org/abs/1703.04933">Sharp</a></li></ul>]]></description><link>http://www.inference.vc/sharp-vs-flat-minima-are-still-a-mystery-to-me/</link><guid isPermaLink="false">e7c273b0-69d8-4ab5-b39f-cedfb16f905e</guid><dc:creator><![CDATA[Ferenc Huszar]]></dc:creator><pubDate>Thu, 18 Jan 2018 22:55:52 GMT</pubDate><media:content url="http://www.inference.vc/content/images/2018/01/Screen-Shot-2018-01-18-at-2.57.18-PM.png" medium="image"/><content:encoded><![CDATA[<img src="http://www.inference.vc/content/images/2018/01/Screen-Shot-2018-01-18-at-2.57.18-PM.png" alt="The Generalization Mystery: Sharp vs Flat Minima"><p>I set out to write about the following paper I saw people talk about on twitter and reddit:</p>

<ul>
<li>Hao Li, Zheng Xu, Gavin Taylor, Tom Goldstein <a href="https://openreview.net/pdf?id=HkmaTz-0W">Visualizing the Loss Landscape of Neural Nets</a></li>
</ul>

<p>It's related to this pretty insightful paper:</p>

<ul>
<li>Laurent Dinh, Razvan Pascanu, Samy Bengio, Yoshua Bengio (2017) <a href="https://arxiv.org/abs/1703.04933">Sharp Minima Can Generalize For Deep Nets</a></li>
</ul>

<p>Inevitably, I started thinking more generally about flat and sharp minima and generalization, so rather than describing these papers in details, I ended up dumping some thoughts of my own. Feedback and pointers to literature are welcome, as always</p>

<h4 id="summaryofthispost">Summary of this post</h4>

<ul>
<li>Flatness of minima is hypothesized to have something to do with generalization in deep nets.</li>
<li>as Dinh et al (2017) show, flatness is sensitive to reparametrization and thus cannot predict generalization ability alone.</li>
<li>Li et al (2017) use a form of parameter normalization to make their method more robust to reparametrization and produce some fancy plots comparing deep net architectures.</li>
<li>While this analysis is now invariant to the particular type of reparametrizations considered by Dinh et al, it may still be sensitive to other types of invariances, so I'm not sure how much to trust these plots and conclusions.</li>
<li>Then I go back to square one and ask how one could construct indicators of generalization that are invariant by construction, for example by considering ratios of flatness measures.</li>
<li>Finally, I have a go at developing a local measure of generalization from first principles. The resulting metric depends on the data and statistical properties of gradients calculated from different minibatches.</li>
</ul>

<h2 id="flatnessgeneralizationandsgd">Flatness, Generalization and SGD</h2>

<p>The loss surface of deep nets tends to have many local minima. Many of these might be equally good in terms of training error, but they may have widely different generalization performance, i.e. an network with minimal training loss might perform very well, or very poorly on a held-out training set. Interestingly, stochastic gradient descent (SGD) with small batchsizes appears to locate minima with better generalization properties than large-batch SGD. So the big question is: what measurable property of a local minimum can we use to predict generalization properties? And how does this relate to SGD?</p>

<p>There is speculation dating back to at least Hochreiter and Schmidhuber (1997) that the flatness of the minimum is a good measure to look at. However, as Dinh et al (2017) pointed out, flatness is sensitive to reparametrizations of the neural network: we can reparametrize a neural network without changing its outputs while making sharp minima look arbitrarily flat and vice versa. As a consequence the flatness alone cannot explain or predict good generalization.</p>

<p>Li et al (2017) proposed a normalization scheme which scales the space around a minimum in such a way that the apparent flatness in 1D and 2D plots is kind of invariant to the type of reparametrization Dinh et al used. This, they say, allows us to produce more faithful visualizations of the loss surfaces around a minimum. They even use 1D and 2D plots to illustrate differences between different architectures, such as a VGG and a ResNet. I personally do not buy the conclusions of this paper, and it seems <a href="https://openreview.net/forum?id=HkmaTz-0W">the reviewers of the ICLR submission</a> largely agreed on this. The proposed method is weakly motivated and only addresses one possible type of reparametrization.</p>

<h2 id="contrastiveflatnessmeasures">Contrastive Flatness measures</h2>

<p>Following the thinking by Dinh et al, if generalization is a property which is invariant under reparametrization, the quantity we use to predict generalization should also be invariant. My intuition is that a good way to achieve invariance is to consider the ratio between two quantities - maybe two flatness measures - which are effected by reparametrization in the same way.</p>

<p>One thing I think would make sense to look at is the average flatness of the loss in a single minibatch vs the flatness of the average loss. Why would this makes sense? The average loss can be flat around a minimum in different ways: it can be flat because it is the average of flat functions which all look very similar and whose minimum is very close to the same location; or it can be flat because it is the average of many sharp functions with minima at locations scattered around the minimum of the average.</p>

<p>Intuitively, the former solution is more stable with respect to subsampling of data, therefore it should be more favourable from a generalization viewpont. The latter solution is very sensitive to which particular minibatch we are looking at, so presumably it may give rise to worse generalization.</p>

<p>As a conclusion of this section, I don't think it makes sense to look at only the flatness of the average loss, looking at how that flatness is effected by subsampling the data somehow feels more key to generalization.</p>

<h2 id="alocalmeasureofgeneralization">A local measure of generalization</h2>

<p>After Jorge Nocedal's <a href="https://iclr.cc/archive/www/lib/exe/fetch.php%3Fmedia=iclr2017:nocedal_iclr2017.pdf">ICLR talk on large-batch SGD</a> Leon Buttou had a comment which I think hit the nail on its head. The process of sampling minibatches from training data kind of simulates the effect of sampling the training set and the test set from some underlying data distribution. Therefore, you might think of generalization from one minibatch to another as a proxy to how well a method would generalize from a training set to a test set.</p>

<p>How can we use this insight to come up with some sort of measure of generalization based on minibatches, especially along the lines of sharpness or local derivatives?</p>

<p>First of all, let's consider the stochastic process $f(\theta)$ which we obtain by evaluating the loss function on a random minibatch. The randomness comes from subsampling the data. This is a probability distribution over loss functions over $\theta$. I think it's useful to seek an indicator of generalization ability as a local property of this stochastic process at any given $\theta$ value.</p>

<p>Let's pretend for a minute that each draw $f(\theta)$ from this process is a convex or at least has a unique global minimum. How would one describe a model's generalization from one minibatch to another in terms of this stochastic process? </p>

<p>Let's draw two functions $f_1(\theta)$ and $f_2(\theta)$ independently (i.e. evaluate the loss on two separate minibatches). I propose that the following would be a meaningful measure:</p>

<p>$$
R = f_2 (\operatorname{argmin}_\theta f_1(\theta)) - \min_\theta f_2(\theta) <br>
$$</p>

<p>Basically: you care about finding low error according to $f_2$ but all you have access to is $f_1$. You therefore look at what the value of $f_2$ is at the location of the minimum of $f_1$ and compare that to the global minimal value of $f_2$. This is a sort of regret expression, hence my use of $R$ to denote it.</p>

<p>Now, in deep learning the loss functions $f_1$ and $f_2$ are not convex, have many local minima, so this definition is not particularly useful in general. However, it makes sense to calculate this value locally, in a small neighbourhood of a particular parameter value $\theta$. Let's consider fitting a restricted neural network model, where only parameters within a certain $\epsilon$ distance from $\theta$ are allowed. If $\epsilon$ is small enough, we can assume the loss functions have a unique global minimum within this $\epsilon$-ball. Furthermore, if $\epsilon$ is small enough, one can use a first-order Taylor-approximation to $f_1$ and $f_2$ to analytically find approximate minima within the $\epsilon$-ball. To do this, we just need to evaluate gradient at $\theta$. this is illustrated in the figure below:</p>

<p><img src="http://www.inference.vc/content/images/2018/01/download-11.png" alt="The Generalization Mystery: Sharp vs Flat Minima"></p>

<p>The left-hand panel shows an imaginary loss function evaluated on some minibatch $f_1$, restricted to the $\epsilon$-ball around $\theta$. We can assume $\epsilon$ is small enough so $f_1$ is linear within this local region. Unless the gradient is exactly $0$, the minimum will fall on the surface of the $\epsilon$-ball, exactly at $\theta - \epsilon \frac{g_1}{\|g_1\|}$ where $g_1$ is the gradient of $f_1$ at $\theta$. This is shown by the yellow star. On the right-hand panel I show $f_2$. This is also locally linear, but its gradient $g_2$ might be different. The minimum of $f_2$ within the $\epsilon$-ball is at $\theta - \epsilon \frac{g_2}{\|g_2\|}$, shown by the red star. We can consider the regret-type expression as above, by evaluating $f_2$ at the yellow star, and substracting its value at the red star. This can be expressed as follows (I divided by $\epsilon$):</p>

<p>$$
\frac{R(\theta, f_1, f_2)}{\epsilon} \rightarrow - \frac{g_2^\top g_1}{\|g_1\|} + \frac{g_2^\top g_2}{\|g_2\|} = \|g_2\| - \frac{g_2^\top g_1}{\|g_1\|} = \|g_2\|(1 - cos(g_1, g_2))
$$</p>

<p>In practice one would consider taking an expectation with respect to the two minibatches to obtain an expression that depends on $\theta$. So, we have just come up with a <strong>local measure of generalization ability</strong>, which is expressed in terms of expectations involving gradients over different minibatches. The measure is local as it is specific for each value of $\theta$. It is data-dependent in that it depends on the distribution $p_\mathcal{D}$ from which we sample minibatches.</p>

<p>This measure depends on two things:</p>

<ul>
<li>the expected similarity of gradients which come from different minibatches $1 - cos(g_1, g_2)$ looks at whether various minibatches of data push $\theta$ in similar directions. In regions where the gradients are sampled from a mostly spherically symmetric distribution, this term would be close to $1$ most of the time.</li>
<li>the magnitude of gradients $\|g_2\|$. Interestingly, one can express this as $\sqrt{\operatorname{trace}\left(g_2 g_2^\top\right)}$.</li>
</ul>

<p>When we take the expectation over this, assuming that the cosine similarity term is mostly $1$ we end up with the expression $\mathbb{E}_g \sqrt{\operatorname{trace}\left(g g_2^\top\right)}$ where the expectation is taken over minibatches. Note that the trace-norm of the empirical Fisher information matrix $\sqrt{ \operatorname{trace} \mathbb{E}_g \left(g g_2^\top\right)}$ can be used as a measure of flatness of the average loss around minima, so there may be some interesting connections there. However, due to Jensen's inequality the two things are not actually the same.</p>

<p><em>Update - thanks for reddit user <a href="https://www.reddit.com/user/bbsome">bbsome</a> for pointing this out</em>:</p>

<p>Note that R is not invariant under reparametrization either. The source of this sensitivity is the fact that I considered an $\epsilon$-ball in Euclidean norm around $\theta$. The right way to get rid of this is to consider an $\epsilon$-ball using the symmetrized KL divergence as instead of the Euclidean norm, similarly to how natural gradient methods can be derived. If we do this, the formula becomes dependent only on the functions the neural network implements, not on the particular choice of parametrization. I leave it as homework for people to work out how this would change the formulae.</p>

<h1 id="summary">Summary</h1>

<p>This post started out as a paper review, but in the end I didn't find the paper too interesting and instead resorted to sharing ideas about tackling the generalization puzzle a bit differently. It's entirely possible that people have done this analysis before, or that it's completely useless. In any case, I welcome feedback.</p>

<p>The first observation here was that a good indicator may involve not just the flatness of the average loss around the minimum, but a ratio between two flatness indicators. Such metrics may end up invariant under reparametrization by construction.</p>

<p>Taking this idea further I attempted to develop a local indicator of generalization performance which goes beyond flatness. It also includes terms that measure the sensitivity of gradients to data subsampling.</p>

<p>Because data subsampling is something that occurs both in generalization (training vs test set) and in minibatch-SGD, it may be possible that these kind of measures might shed some light on how SGD enables better generalization.</p>]]></content:encoded></item><item><title><![CDATA[Alchemy, Rigour and Engineering]]></title><description><![CDATA[<p>Like many of you, I thoroughly enjoyed <a href="https://www.youtube.com/watch?v=ORHFOnaEzPc">Ali Rahimi's NIPS talk</a> in response to winning the test-of time award for their work on random kitchen sinks. I recommend everyone to watch it if you haven't already.</p>

<p>I also read <a href="https://www.facebook.com/yann.lecun/posts/10154938130592143">Yann LeCun's rebuttal to Ali's talk</a>. He says what Ali calls</p>]]></description><link>http://www.inference.vc/my-thoughts-on-alchemy/</link><guid isPermaLink="false">08cd6193-b261-4444-80c1-49d59332533b</guid><dc:creator><![CDATA[Ferenc Huszar]]></dc:creator><pubDate>Thu, 07 Dec 2017 12:56:21 GMT</pubDate><media:content url="http://www.inference.vc/content/images/2017/12/The-Alchemy-of-Beiong-Human-in-One-Image.jpg" medium="image"/><content:encoded><![CDATA[<img src="http://www.inference.vc/content/images/2017/12/The-Alchemy-of-Beiong-Human-in-One-Image.jpg" alt="Alchemy, Rigour and Engineering"><p>Like many of you, I thoroughly enjoyed <a href="https://www.youtube.com/watch?v=ORHFOnaEzPc">Ali Rahimi's NIPS talk</a> in response to winning the test-of time award for their work on random kitchen sinks. I recommend everyone to watch it if you haven't already.</p>

<p>I also read <a href="https://www.facebook.com/yann.lecun/posts/10154938130592143">Yann LeCun's rebuttal to Ali's talk</a>. He says what Ali calls alchemy is in fact engineering. Although I think Yann ends up arguing against points Ali didn't make in his talk, he raises important and good points about the different roles that tricks, empirical evidence and theory can play in engineering.</p>

<p>I wanted to add my own thoughts and experiences to the mix.</p>

<h2 id="differentmodesofinnovation">Different modes of innovation</h2>

<p>We can think about machine learning knowledge as a graph, where methods are nodes and edges represent connections or analogies between the methods.</p>

<p>Innovation means growing this graph, which one can do in different ways:</p>

<ol>
<li><strong>add new nodes:</strong> This is about thinking outside the box: about discovering something completely new and quirky that, hopefully, works. Recent examples of this type of ML innovation include herding, noise-as-targets, batch-norm, dropout or GANs. You added a new node to the graph, maybe weakly connected to the rest of the graph, but your reasoning may not be very clear to most people at first.  </li>
<li><strong>discover connections between existing nodes:</strong> This is about finding explanations for new methods. Interpreting k-means as expectation-maximization, dropout as variational inference, stochastic gradient descent as variational inference, GANs as f-GANs, batchnorm as natural gradients, denoising autoencoders as score matching, etc. These are all <em>hugely influential</em> pieces of work, and we sorely need this kind of work in ML as these connections often allow us to improve or generalize techniques and thereby to make more predictable progress.  </li>
<li><strong>complete patterns:</strong> This is about analogical reasoning. You notice incomplete patterns in the graph, and infer that some nodes must exist by completing the pattern. At the <a href="https://nips.cc/Conferences/2010/SamRoweis">Sam Roweis Symposium at NIPS 2010</a> I remember Zoubin Ghahramani making a joke about how he and Sam wrote a bunch of papers together basically for free by completing hypercubes, just like the one below. Long time readers of my blog will know that Sam and Zoubin's <a href="http://mlg.eng.cam.ac.uk/zoubin/papers/lds.pdf">unifying review</a> is one of my all-time favourite papers. Note that this 'pattern completion' happens outside the Bayesian world, too: see e.g. bi-directional convolutional RNNs or PixelGAN Autoencoders.</li>
</ol>

<div class="image-div" style="width: 600px;">  
<img src="http://www.inference.vc/content/images/2017/12/F4.large.jpg" alt="Alchemy, Rigour and Engineering">
</div>  

<p>Each of these modes of innovation are important in the process of building knowledge and moving forward. Our community tends to celebrate and glorify #1 (as long as the paper has the reigning buzzword in the title). Individually, we are incentivized to write loads of papers, so we spend most of our time on #3 which is arguably the lowest risk incremental project. I don't think we reward or encourage #2 enough, and to me, this is one of the main messages in Ali's talk. By focussing too much on benchmarks, datasets, and empirical results, we are actually creating barriers for people whose main contribution would be explaining why methods work or don't work.</p>

<p>It's easy to forget what the original GAN paper's results looked like. These were really good back then and would be laughable today:</p>

<div class="image-div" style="width: 600px;">  
<img src="http://www.inference.vc/content/images/2017/12/Screen-Shot-2017-12-07-at-11.49.46-AM.png" alt="Alchemy, Rigour and Engineering">
</div>

<p>GANs were, arguably, a highly influential, great new idea. Today, that same paper would be unpublishable because the pictures don't look pretty enough. Wasserstein GANs were a great idea, and frankly, to recognize it is a good idea and I don't need to look at experimental results at all.</p>

<p>The same way Yann argues neural nets were unfairly abandoned in the 90s because of the lack of convergence guarantees convex optimization methods enjoy, today we are unfairly dismissing any method or idea that does not produce state-of-the-art or near-SOTA results. I once reviewed a paper where another reviewer wrote "The work can be novel if the method turns out to be the most efficient method of [...] compared to existing methods". This is at least as wrong as dismissing novel ideas for lack of theory (which is, BTW, not what Ali suggested).</p>

<h2 id="itsokaytousenonrigorousmethodsbut">It's Okay to use non-rigorous methods, but...</h2>

<h4 id="itsnotokaytousenonrigorousevaluation">it's not okay to use non-rigorous evaluation</h4>

<p>As far as I'm concerned, I'm now comfortable with using methods that are non-rigorous or for which the theoretical framework is underdeveloped or does not exist. However,</p>

<blockquote>
  <p>nobody should feel good about any paper where the evaluation is non-rigorous</p>
</blockquote>

<p>In GAN papers we show pretty pictures, but we have absolutely no rigorous way to assess the diversity of samples, or whether any form of overfitting has occurred, at least not as far as I'm aware. I know from experience that getting a sufficiently novel deep learning idea to work is a fragile process: it starts with  nothing working, then maybe it starts to work but doesn't converge,  then it converges but to the wrong thing. The whole thing just doesn't work until it suddenly starts to, and very often it's unclear what specifically made it work. This process is akin to multiple hypothesis testing. You ran countless experiments and report the result that looks best and behaves the way you expected it to behave. The underlying problem is, we conflate the software development process required to implement a method with manual hyperparameter search and cherry-picking of results. As a result, our reported "empirical evidence" may be more biased and less reliable than one would hope.</p>

<h4 id="youshouldnotresisttheoryonceitbecomesavailable">you should not resist theory once it becomes available</h4>

<p>I agree with Yann that there is merit in starting to adopt techniques before theory or rigorous analysis becomes available. However, once theoretical insight becomes available, reasoning about empirical performance often continues to trump rigour.</p>

<p>Let me tell you about a pattern I encountered a few times (you might say I'm just bitter about this). The pattern: Someone comes up with an idea, they demonstrate it works very well on some large, complicated problem using a very large neural network and loads of tricks and presumably months of manual tweaking and hyper-parameter search. I find a theoretical problem with the method. People say: it still works well in practice, so I don't see a problem.</p>

<div class="image-div" style="width: 600px;">  
<img src="http://www.inference.vc/content/images/2017/12/GxTnK.gif" alt="Alchemy, Rigour and Engineering">
</div>

<p>This was the kind of response I got to <a href="http://www.inference.vc/scheduled-sampling-for-rnns-scoring-rule-interpretation/">my critique of scheduled sampling</a> and <a href="http://www.inference.vc/comment-on-overcoming-catastrophic-forgetting-in-nns-are-multiple-penalties-needed-2/">my critique of elastic weight consolidation</a>. In both of these cases reviewers pointed out the methods work just fine on "real-world problems", and in the case of scheduled sampling people commented "after all the method came first in a benchmark competition so it must be correct". No, if a method works, but works for the wrong reasons, or for different reasons the authors gave, we have a problem.</p>

<p>You can think of "making a a deep learning method work on a dataset" as a statistical test. I would argue that the statistical power of experiments is very weak. We do a lot of things like early stopping, manual tweaking of hyperparameters, running multiple experiments and only reporting the best results. We probably all know we should not be doing these things when testing hypotheses. Yet, these practices are considered fine when reporting empirical results in ML papers. Many go on and consider these reported empirical results as "strong empirical evidence" in favour of a method.</p>

<h2 id="summary">Summary</h2>

<p>I want to thank Ali for giving this talk. Yes, it was confrontational. Insulting? I think provocative is a better word. It was at least somewhat controversial, apparently. It contains a few points I disagree with. But I do not think it was wrong.</p>

<p>It touched upon a lot of problems which I think should be recognized and appreciated by the community. Rigour is not about learning theory, convergence guarantees, bounds or theorem proving. Intellectual rigour applies and can be applied to all of machine learning whether or not we have fully developed mathematical tools for analysis.</p>

<p>Rigour means being thorough, exhaustive, meticulous. It includes good practices like honestly describing the potential weaknesses of a method; thinking about what might go wrong; designing experiments which highlight and analyze these weaknesses; making predictions about your algorithm's behaviour in certain toy cases and demonstrating empirically that it indeed behaves as expected; refusing to use unjustified evaluation methods; accepting and addressing criticism. All of these should apply to machine learning, deep or not, and indeed they apply to engineering as a whole.</p>]]></content:encoded></item><item><title><![CDATA[Grosse's challenge: duality and exponential families]]></title><description><![CDATA[<p>I wrote this post in response to a challenge by Roger Grosse:</p>

<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Theorem 2 of my moment averaging paper is one of the most surprising mathematical results I&#39;ve seen in machine learning. The proof is very short, but I still have no intuition for why it&#39;s</p></blockquote>]]></description><link>http://www.inference.vc/grosses-challenge/</link><guid isPermaLink="false">0d0a8fac-f3df-4a1c-8f89-c6a3444725c0</guid><dc:creator><![CDATA[Ferenc Huszar]]></dc:creator><pubDate>Wed, 29 Nov 2017 13:41:20 GMT</pubDate><media:content url="http://www.inference.vc/content/images/2017/11/download-3-2.png" medium="image"/><content:encoded><![CDATA[<img src="http://www.inference.vc/content/images/2017/11/download-3-2.png" alt="Grosse's challenge: duality and exponential families"><p>I wrote this post in response to a challenge by Roger Grosse:</p>

<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">Theorem 2 of my moment averaging paper is one of the most surprising mathematical results I&#39;ve seen in machine learning. The proof is very short, but I still have no intuition for why it&#39;s true. $500 reward for a clear &amp; convincing intuitive explanation.<a href="https://t.co/WjAMg3qIRm">https://t.co/WjAMg3qIRm</a></p>&mdash; Roger Grosse (@RogerGrosse) <a href="https://twitter.com/RogerGrosse/status/935176163542126592?ref_src=twsrc%5Etfw">November 27, 2017</a></blockquote>  

<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

<p>As a consequence of responding to a question, this will be a bit of an obscure post, and it may not be very meaningful outside the context of annealed importance sampling. Nevertheless, the post gives me a chance to talk about some interesting aspects of Bregman divergences, exponential families, and hopefully teach some you some new things you may not have come across before.</p>

<ul>
<li>Grosse, Maddison and Salakhutdinov (NIPS 2013) <a href="http://papers.nips.cc/paper/4879-annealing-between-distributions-by-averaging-moments.pdf">Annealing between distributions by averaging moments</a></li>
</ul>

<p>If you look further down the thread below Roger's email, it seems <a href="https://twitter.com/AIActorCritic">@AIActorCritic</a> started answering the question in terms of duality of parametrizations - my post also generalizes things along the lines of duality but goes into a bit more detail.</p>

<h3 id="summary">Summary</h3>

<ul>
<li>I start with a summary of the results Roger referred to in his tweet.</li>
<li>Roger found it surprising that Eqns (18) and (19) give the same result.</li>
<li>I first note that the RHS of Eqns. (18) and (19) is not just something arbitrary, but it is actually the Jeffreys divergence (symmetrized KL) between the endpoints of the path. This gives us hope that there's maybe a more general phenomenon at work here.</li>
<li>I will then generalize the results by replacing the KL divergence with a more general class of Bregman divergences, and show that the integral of infinitesimal divergences along a linear path will always yield the symmetrized divergence between the endpoints.</li>
<li>I still lack intuition as to why this actually happens, but I will draw some pretty pictures about Bregman divergences nevertheless.</li>
<li>then I note the equivalence of Bregman divergences under convex duality, and show that Eqns. (18) and (19) are special cases of the general result for a dual pair of Bregman divergences, both equivalent to the KL divergence. This explains why the two paths give equivalent results.</li>
<li>Finally, I touch upon how this general Bregman divergence formulation might provide a way to analyze linear interpolation in other parameter spaces. </li>
</ul>

<h2 id="introduction">Introduction</h2>

<p>Context: the authors analyze the bias of an Annealed Importance Sampling (AIS) estimator, whose goal is to estimate the normalizing constant of an intractable distribution $p$. AIS constructs a path in the space of probability distributions starting from a tractable distribution $q$ and ending at $p$ via a sequence of intermediate distributions $q=p_0,p_1, \ldots, p_{K-1}, p_K=p$. This path is then used to construct an estimator to the log-partition function (normalizing constant) of $p$. Crucially, the bias of this estimator depends on, among other things, the particular path we take between the two distributions: there are multiple paths between $q$ and $p$, and these paths can result in lower or higher bias. Under some simplifying assumptions the bias is given by:</p>

<p>$$
\delta = \sum_{k=0}^{K-1} D_{KL}[p_k\|p_{k+1}].
$$</p>

<p>It is not hard to show that as $K\rightarrow \infty$, this bias vanishes, but what the paper looks at is its precise asymptotic behaviour, with the main insight being Eqn (4):</p>

<p>$$
K\delta \rightarrow \int_0^1 \dot{\theta}(\beta)^\top G_\theta(\beta)\dot{\theta}(\beta) d\beta =:\mathcal{F}(\gamma), <br>
$$</p>

<p>where $\gamma$ denotes the path as a whole, $\theta(\beta), \beta\in[0,1]$ is the path we trace in parameter space $\theta$, and $G_\theta(\beta)$ is the Fisher information matrix with respect to $\theta$ evaluated at $\theta(\beta)$.</p>

<p>Intriguingly, if you calculate $\mathcal{F}(\gamma)$ along two very different paths in distribution space - a straight line in natural parameter space or a straight line in moments space - you get exactly the same value. This is despite the fact that these paths clearly path through very different distributions as illustrated very nicely in  Figure 1 of the paper.</p>

<p>So the question is: why are these two paths equivalent in this sense? The answer might lie in convex conjugate duality.</p>

<h2 id="firstobservation">First observation</h2>

<p>The first observation we can make is that Eqns. (18) and (19) do not only show that the two paths are equivalent, they also shows that the cost can be expressed in terms of the symetrized KL divergence (sometimes called the Jeffreys divergence) between the endpoints:</p>

<p>$$
\mathcal{F}(\gamma_{MA}) = \mathcal{F}(\gamma_{GA}) = \frac{1}{2}\left( D_{KL}[p_{\beta=0}\|p_{\beta=1}] + D_{KL}[p_{\beta=1}\|p_{\beta=0}] \right).
$$</p>

<p>To prove this, one can simply use the following formula for the $KL$ divergence between two exponential family distributions with natural parameters $\eta'$ and $\eta$:</p>

<p>$$
D_{KL}[p_{\eta'}\|p_{\eta}] = A(\eta') - A(\eta) - s^\top(\eta' - \eta), <br>
$$
where $A(\eta) = \log\mathcal{Z}(\eta)$ is the log-partition function, and $s$ denotes the moments of $p_{\eta}$ just like in the paper. This observation is interesting as we may be looking at special cases of a more general result.</p>

<h2 id="generalizationtobregmandivergences">Generalization to Bregman divergences</h2>

<p>Any convex functional $H$ over a convex domain induces a Bregman divergence, defined between two points $p$ and $q$ as:</p>

<p>$$
D_H(p\|q) = H(p) - H(q) - \langle\nabla H(q), p-q \rangle, <br>
$$
where $\langle\rangle$ denotes inner product. I used this notation rather than vector product to emphasize the fact that $p$ and $q$ might be functions or probability distributions not necessarily just finite dimensional vectors. In what follows I switch back to usual vector notation.</p>

<p>Examples of Bregman divergences include the KL divergence between distributions, the squared Euclidean distance between points in a Euclidean space, the maximum mean discrepancy between probability distributions, and many more. To read more about them, I recommend <a href="http://mark.reid.name/blog/meet-the-bregman-divergences.html">Mark Reid's great summary</a>.</p>

<p>We can replace the KL divergence in $\delta$ by an arbitrary Bregman divergence, which yields the following generalization of Eqn. (4):</p>

<p>\begin{align}
K \sum_{k=0}^{K-1}D_H(p_k\|p_{k+1}) &amp;\rightarrow \frac{1}{2}\int_0^1 \dot{p}(\beta)^\top \nabla^2 H(\beta) \dot{p}(\beta) d\beta\\ <br>
&amp;= \frac{1}{2}\int_0^1 \dot{\nabla H}(\beta)^\top \dot{p}(\beta) d\beta,
\end{align}
where $\dot{\nabla H}(\beta)$ is the derivative of $\nabla H(p(\beta))$ with respect to $\beta$, and the second line was obtained by applying the chain rule $\dot{\nabla H}(\beta) = \dot{p}(\beta)\nabla^2 H(\beta)$. $\nabla^2 H(\beta)$ is called the metric tensor and it is a generalization of the Fisher information matrix $G_\theta$ before.</p>

<p>It turns out, if we interpolate in $p$-space linearly, that is, trace a path where $p(\beta) = \beta p_1 + (1 - \beta) p_0$, we can express the above integral analytically in terms of the endpoints of the path $p_0$ and $p_1$:</p>

<p>\begin{align}
 \frac{1}{2}\int_0^1 \dot{\nabla H}(\beta)\dot{p}(\beta) d\beta &amp;= \frac{1}{2}\int_0^1 \dot{\nabla H}(\beta)^\top d\beta (p_1 - p_0) \\
&amp;= \frac{1}{2} \left( \nabla H(p_1) - \nabla H(p_0) \right)^\top \left( p_1 - p_0 \right)\\
&amp;= \frac{1}{2}\left( D_H(p_1\|p_0) + D_H(p_0\|p_1) \right).
\end{align}</p>

<p>This final expression is the symmetrized Bregman divergence between the endpoints of the path $p_1$ and $p_0$, a generalization of the symmetrised KL that we found before.</p>

<p>Before showing how this general result is useful to explain the equivalence of Eqns. (18) and (19), I want to take some time to draw some pretty pictures visualizing Bregman divergences, the quantities we are dealing with are.</p>

<h3 id="abitofgraphicalintuition">A bit of graphical intuition</h3>

<p>Below is a graphical illustration of a Bregman divergence between two points $p$ and $q$ with respect to some convex potential $H$.</p>

<p><img src="http://www.inference.vc/content/images/2017/11/download-1.png" alt="Grosse's challenge: duality and exponential families"></p>

<p>Here's the story: Poppy and Quentin are about to get married. It is bad luck to see each other all dressed up before the wedding. Therefore, they decide to spend the morning on a convex hill (convex from the top). Poppy is at coordinates $p$ and Quentin at $q\neq p$ . The surface of the hill is described by $-H$ where $H$ is a convex function. Precisely because the hill is convex, Quentin can't see Poppy, not unless she starts jumping. The Bregman divergence describes the safe height Poppy is allowed to jump up without Quentin seeing her. The more hilly the hill - higher the <em>curvature</em> - between them, the higher Poppy can jump without being seen by Quentin.</p>

<p>For this figure I used a boring parabolic hill, $-H(p) = p(1-p)$. The resulting divergence actually ends up symmetric and is simply proportional to the squared Euclidean distance $(p-q)^2$:</p>

<p><img src="http://www.inference.vc/content/images/2017/11/download-2.png" alt="Grosse's challenge: duality and exponential families"></p>

<p>But this is an exception, rather than the rule: most Bregman divergences are asymmetric. For the convex function $H(p) = p(1-p^3)$, this is what the picture looks like:</p>

<p><img src="http://www.inference.vc/content/images/2017/11/download-3.png" alt="Grosse's challenge: duality and exponential families"></p>

<p>OK, so what did we just prove before? Consider we have a sequence of bridesmaids equally placed between Poppy and Quentin... Actually, let's stop the wedding analogy before it's too far.</p>

<p>Consider a sequence of points $p_0,\ldots,p_K$ equally placed between $p=p_0$ and $q=p_K$. Now look at the sum of divergences between subsequent points $\sum_k D[p_k,p_{k+1}]$. The divergences say how high each of the $p_k$ has to jump to see $p_{k+1}$. For $K=3$ we are interested in the sum of the red line segments</p>

<p><img src="http://www.inference.vc/content/images/2017/11/download-5.png" alt="Grosse's challenge: duality and exponential families"></p>

<p>For $K=5$:</p>

<p><img src="http://www.inference.vc/content/images/2017/11/download-7.png" alt="Grosse's challenge: duality and exponential families"></p>

<p>For $K=12$:</p>

<p><img src="http://www.inference.vc/content/images/2017/11/download-9.png" alt="Grosse's challenge: duality and exponential families"></p>

<p>It is pretty clear that the line segments get shorter and shorter, and indeed their sum converges to $0$. What we have proved is that asymptotically this sum behaves like $\frac{1}{2K}\left(D[p\|q] + D[q\|p]\right)$. Why this is the case is still a mistery to me, and I don't know how to even visualise this. But I wouldn't be surprised if there turned out to be a good reason for this.</p>

<h2 id="dualbregmandivergences">Dual Bregman divergences</h2>

<p>After a bit an excursion into visualising Bregman divergences in one dimension, let's go back to the question of why linear interpolation in <em>natural parameter</em>-space or <em>moment</em> space gives the $\mathcal{F}(\gamma)$. The answer lies in duality.</p>

<p>Any convex function $H$ defines a Bregman divergence on a convex domain. Any convex function $H$ also has a convex conjugate $H^\ast$. This convex conjugate, $H^\ast$, also defines a Bregman divergence on its own domain, which is generally different from the domain of $H$. Furthermore, this <em>dual</em> divergence is equivalent to the original in the following sense:</p>

<p>$$
D_{H^{\ast}}[p^{\ast}\|q^{\ast}] = D_{H}[p\|q], <br>
$$
where $p^{\ast} = \nabla H(p)$ and $q_{ast} = \nabla H(q)$ are called the <em>dual parameters</em> corresponding to $p$ and $q$, respectively. The mapping between parameters and dual parameters is one-to-one, thanks to convexity. $p$ and $q$ are sometimes called the primal parameters, but this distinction is rather arbitrary as conjugate duality is a symmetric relationship.</p>

<p>With an understanding of duality, let's look at the formula that we obtained before:</p>

<p>\begin{align}
K \sum_{k=0}^{K-1}D_H(p_k\|p_{k+1}) &amp;\rightarrow \frac{1}{2}\int_0^1 \dot{\nabla H}(\beta)^\top \dot{p}(\beta) d\beta, <br>
\\
&amp;= \frac{1}{2}\int_0^1 \dot{p^\ast}(\beta)^\top \dot{p}(\beta) d\beta\\
&amp;= \frac{1}{2}\left(p^\ast(1) - p^\ast(0)\right)^\top\left(p(1) - p(0)\right),
\end{align}
observe the bilinearity of the formula with respect to the primal and dual parameters.</p>

<h4 id="conjugatedualityinexponentialfamilies">Conjugate duality in exponential families</h4>

<p>Let's now consider an exponential family of the form:</p>

<p>$$
p(x\vert \eta) = h(x)\exp(\eta^\top g(x) - A(\eta)) <br>
$$</p>

<p>where $A(\eta)=\log\mathcal{Z}(\eta)$ is the log partition function. As $A$ is a convex function of $\eta$, we can define a Bregman divergence in the coordinate system of natural parameters $\eta$ induced by the (convex) log-partition function $A$:</p>

<p>$$
D_A[\eta'\|\eta] = A(\eta') -  A(\eta) - \nabla A(\eta)^T\left( \eta' - \eta\right). <br>
$$</p>

<p>(notice how similar this is to the KL divergence expression I used before)</p>

<p>The dual parameter corresponding to $\eta$ turns out to be the moments parameters $s$:</p>

<p>$$
\eta^\ast = \nabla A (\eta) = \mathbb{E}_\eta[g(x)] = s,
$$</p>

<p>The equality in the middle is a well-known property of exponential families, the proof is similar to how you would obtain the REINFORCE gradient estimator, for example. Using this equality we can rewrite the divergence above as:</p>

<p>$$
D_A[\eta'\|\eta] = A(\eta') -  A(\eta) - s^T\left( \eta' - \eta\right). <br>
$$</p>

<p>If we consider the natural parameters $\eta$ the primal parameters, the moments $s$ are the dual parameters. As always, there is a one-to-one mapping between primal and dual parameters. Using the convex conjugate $A^{\ast}$, we can therefore define another Bregman divergence in the coordinate system of moments $s$ as follows:</p>

<p>\begin{align}
D_{A^\ast}[s'\|s] &amp;= A^\ast(s') -  A^\ast(s) - \nabla A^\ast(s)^T\left(s' - s\right)\\ <br>
      &amp;= A^\ast(s') -  A^\ast(s) - \eta^T\left(s' - s\right)
\end{align}</p>

<p>As it turns out this convex conjugate $A^\ast$ is the negative Shannon's entropy of the distribution parametrized by its moments $s$ (it's beyond this post to show this, see e.g. <a href="http://onlinelibrary.wiley.com/doi/10.1002/9781118857281.ch9/summary">this book chapter</a>.</p>

<p>So now we have two Bregman divergences, one in the primal space of $\eta$ and one in the dual space $s$. These two divergences are equivalent, and in this case they are also both equivalent to the Kulback-Leibler divergence between the corresponding distributions:</p>

<p>$$
D_{A}[\eta'\|\eta] = D_{A^\ast}[s'\|s] = D_{KL}[p'\|p] <br>
$$</p>

<p>So, putting these things together, we can now understand why Eqns. (18) and (19) give the same result. If we interpolate linearly in primal space between $\eta(0)$ and $\eta(1)$ we get that the loss of the path is:</p>

<p>$$
\mathcal{F}(\gamma_{GA}) = \frac{1}{2}\left( D_A[\eta(1)\|\eta(0)] + D_A[\eta(0)\|\eta(1)]\right)
$$</p>

<p>Similarly, interpolating between $s(0)$ and $s(1)$ in moment space we obtain:</p>

<p>$$
\mathcal{F}(\gamma_{MA}) = \frac{1}{2}\left( D_{A^\ast}[s(1)\|s(0)] + D_{A^\ast}[s(0)\|s(1)]\right).
$$</p>

<p>And by the duality of the $A$ and $A^\ast$ we get that the two are equal, and also equal to the symetrized KL divergence between the distributions.</p>

<p>Interestingly, this equivalence also suggests that linear interpolation in the distribution space, $p(\beta) = \beta p_1 + (1 - \beta) p_0$, would also give the same result. This would interpolate between $p_0$ and $p_1$ via mixtures between the two distributions. This option may not be of practical relevance though.</p>

<h2 id="interpolationinotherparameterspaces">Interpolation in other parameter-spaces</h2>

<p>This framework might allow us to study linear interpolation in parameter spaces which are related to natural parameters by a convex operation. So long as $f$ is invertible and $f^{-1}$ has a positive definite Jacobian, we can define a Bregman divergence in $\theta$-space using the convex function $f^{-1}\circ A$. I'm not sure if this would actually work out or whether this would yield any insights.</p>

<h2 id="summary">Summary</h2>

<p>This was my attempt at providing more insight into Theorem 2 of Roger's paper he found very surprising. My post probably does little to make the result look less surprising, it merely shifts the surprise elsewhere. The story goes as follows:</p>

<ol>
<li>integrating the Bregman divergence along a linear path we obtain the symmetrized Bregman divergence between the endpoints of the path  </li>
<li>For exponential families the KL divergence can be equivalently expressed as a Bregman divergence in natural parameter space or as a Bregman divergence in moments space. The scalar potentials of these divergences are conjugate duals of each other, hence their equivalence.  </li>
<li>Therefore integrating KL divergence along a linear path in either $\eta$ or $s$ space yields the same result, due to the equivalence of the dual Bregman divergences.</li>
</ol>]]></content:encoded></item><item><title><![CDATA[Thanksgiving Special 🦃: GANs are Being Fixed in More than One Way]]></title><description><![CDATA[<p>In the spirit of thanksgiving, let me start by thanking all the active commenters on my blog: you are always very quick to point out typos, flaws and references to literature I overlooked.</p>

<p>This post is basically a follow-up my earlier post, "<a href="http://www.inference.vc/my-notes-on-the-numerics-of-gans/">GANs are broken in more than one way</a></p>]]></description><link>http://www.inference.vc/gans-are-being-fixed-in-more-than-one-way/</link><guid isPermaLink="false">bb4f311d-e633-49e2-b94e-3d69d1556c81</guid><dc:creator><![CDATA[Ferenc Huszar]]></dc:creator><pubDate>Thu, 23 Nov 2017 15:14:38 GMT</pubDate><media:content url="http://www.inference.vc/content/images/2017/11/Screen-Shot-2017-11-23-at-3.13.40-PM.png" medium="image"/><content:encoded><![CDATA[<img src="http://www.inference.vc/content/images/2017/11/Screen-Shot-2017-11-23-at-3.13.40-PM.png" alt="Thanksgiving Special 🦃: GANs are Being Fixed in More than One Way"><p>In the spirit of thanksgiving, let me start by thanking all the active commenters on my blog: you are always very quick to point out typos, flaws and references to literature I overlooked.</p>

<p>This post is basically a follow-up my earlier post, "<a href="http://www.inference.vc/my-notes-on-the-numerics-of-gans/">GANs are broken in more than one way</a>". In this one, I review and recommend some additional references people pointed me to after the post was published. It turns out, unsurprisingly, that a lot more work on these questions than I was aware of. Although GANs are kind of broken, they are also actively being fixed, in more than one way.</p>

<h3 id="thiswastalkedaboutbefore">This was talked about before</h3>

<p>In my post I focussed on potential problems that arise when we apply simultaneous gradient descent on a non-conservative vector field, or equivalently, a non-cooperative game.</p>

<p>These problems have been pointed out before by a number of authors before. As one example, <a href="https://arxiv.org/abs/1606.03498">(Salimans et al, 2017)</a> talked about this and in fact used the same pathological example I used in my plot: the one which gives rise to the constant curl field:</p>

<p><img src="http://www.inference.vc/content/images/2017/11/foo.gif" alt="Thanksgiving Special 🦃: GANs are Being Fixed in More than One Way"></p>

<p><a href="https://arxiv.org/abs/1610.01945">Pfau and Vinyals (2016)</a> pointed out connections between GANs and actor critic methods. GANs and actor-critic methods are both notoriously hard to optimize, and the authors argue that these difficulties arise because of the same reasons. This means that two sets of people have been struggling to solve the same underlying problems, so it is reasonable to hope that we can learn from each other's innovations.</p>

<h2 id="wedontalwaysusesimultaneousgradientdescent">We don't always use simultaneous gradient descent</h2>

<p>Many people pointed out that when using different variants of GANs, we don't actually use simultaneous gradient descent. Notably, in <a href="https://arxiv.org/abs/1701.07875">WGANs</a>, and also often in normal GANs when using instance noise, we always optimize the discriminator until convergence before taking a step with the generator. This nested-loop algorithm has very different behaviour compared to simultaneous gradient descent. Indeed, you can aruge it converges inasmuch as the outer loop which optimizes the generator can be interpreted as vanilla gradient descent on a lower bound. Of course this convergence only happens in practice if the optimal discriminator is unique with respect to the generator, or at least if the solution is stable enough to assume that the outer loop indeed evaluates gradients of a deterministic scalar function.</p>

<p><a href="https://arxiv.org/abs/1706.08500">(Heusel et al, 2017)</a> propose another optimization strategy relying on updating the discriminator and generator at different time scales/learning rates. The authors prove that this algorithm always converges to a local Nash equilibrium, and this property holds even when using the Adam optimizer on top. This algorithm was later used successfully for training <a href="https://arxiv.org/abs/1708.08819">Coulomb GANs</a>.</p>

<h2 id="gradientdescentganoptimizationislocallystable">Gradient Descent GAN optimization is locally stable</h2>

<p>In another great NIPS paper <a href="https://arxiv.org/abs/1706.04156">(Nagarajan and Kolter, 2017)</a> study convergence properties of simultaneous gradient descent for various GAN settings. Using similar apparatus to the <a href="https://arxiv.org/abs/1706.04156">Numerics of GAN</a> paper, they show that simultaneous GD is actually stable and convergent in the local neighborhood of the Nash equilibrium. On the other hand, it is also shown that WGANs have non-convergent limit cycles around the equilibrium (these are vector fields that make you go around in circles, like the example above). So, in WGANs we have a real problem with gradient descent. Based on these observations the authors go on to define regularization strategy that stabilizes gradient descent. Highly recommended paper.</p>]]></content:encoded></item><item><title><![CDATA[A Cookbook for Machine Learning: Vol 1]]></title><description><![CDATA[<p>This was a busy week, I had no time to read anything new, so I'm sharing a note that I wrote for myself, for no other reason than to understand things better. It's a kind of cookbook of various "transformations" you can apply to a machine learning problem to eventually</p>]]></description><link>http://www.inference.vc/design-patterns/</link><guid isPermaLink="false">51417300-aef2-4a03-8877-e760ccc46218</guid><dc:creator><![CDATA[Ferenc Huszar]]></dc:creator><pubDate>Thu, 16 Nov 2017 15:23:17 GMT</pubDate><content:encoded><![CDATA[<p>This was a busy week, I had no time to read anything new, so I'm sharing a note that I wrote for myself, for no other reason than to understand things better. It's a kind of cookbook of various "transformations" you can apply to a machine learning problem to eventually turn it into something we know how to solve: seeking stable attractors of a tractable vector field.</p>

<p>The typical setup is: you have some model parameters $\theta$. You seek to optimize some objective criterion, but the optimization problem is intractable or hard in one of the ways listed below. You then apply the corresponding transformation to your problem if you can. If your problem is now one you can efficiently optimize, great. If not, you can recursively apply the transformations until it is.</p>

<p>UPDATE: Although I called this post a cookbook, as readers so rightly pointed out, it is too light on details to be considered a cookbook. Consider it as a demonstration of a way of thinking about machine learning research as a compiler that compiles an abstract machine learning objective into the canonical optimization problem of finding stable attractors of a tractable vector field.</p>

<p>For the first batch, I have written up the following problem transformations:</p>

<ul>
<li><a href="http://www.inference.vc/design-patterns/#variationalupperbounds">Variational bounds</a></li>
<li><a href="http://www.inference.vc/design-patterns/#adversarialgames">Adversarial games</a></li>
<li><a href="http://www.inference.vc/design-patterns/#evolutionstrategies">Evolution Strategies</a></li>
<li><a href="http://www.inference.vc/design-patterns/#convexrelaxation">Convex relaxation</a></li>
</ul>

<p>There are many more transformations not included here, like the duality principle, half-quadratic splitting, Lagrangian multipliers, etc. Feel free to leave comments about what else I should include next.</p>

<h2 id="variationalbounds">Variational bounds</h2>

<h4 id="typicalproblem">Typical problem:</h4>

<p>My loss function $f(\theta)$ is intractable to compute, typically because it involves intractable marginalization. I can't evaluate it let alone minimize it.</p>

<h4 id="solution">Solution:</h4>

<p>Let's construct a family of - typically differentiable - upper-bounds: <br>
$$
f(\theta) \leq \inf_\psi g(\theta, \psi), <br>
$$
and solve the optimization problem <br>
$$
\theta^\ast, \psi^\ast \leftarrow \operatorname{argmin}_{\theta, \psi} g(\theta, \psi)
$$
instead. Technically, once optimization is finished, you can discard the auxiliary parameter $\psi^\ast$ - although often turns out to be meaningful and useful in itself, often for approximate inference such as the recognition model of VAEs.</p>

<h4 id="tricksofthetrade">Tricks of the trade:</h4>

<p><em>Jensen's inequality:</em> The mean value of a convex function is never lower than the value of the convex function applied to the mean.  Generally appears in some variant of the standard evidence lower bound (ELBO) derivation below:</p>

<p>\begin{align}
- \log p(x) &amp;= - \log \int p(x,y) dy \\
&amp;= - \log \int q(y\vert x) \frac{p(y,x)}{q(y\vert x)}dy \\
&amp;\leq - \int q(y\vert x) \log \frac{p(y,x)}{q(y\vert x)} dy
\end{align}</p>

<p><em>Reparametrization trick:</em> In variational inference we often encounter gradients of the form:
$$
\frac{\partial}{\partial \theta_i}\mathbb{E}_{x \sim q_\theta}\left[f(x, q_\theta(x))\right],
$$
where the pdf of the variable appears in the integrand. If we can find a function $h:(\mathcal{E}, \Theta)\mapsto \mathcal{X}$ which is differentiable w.r.t. its second argument, and probability distribution $p_\epsilon$ over $\mathcal{E}$ which is easy to sample from, such that the following holds: <br>
$$
x = h(\epsilon, \theta), \epsilon \sim p_\epsilon \iff x \sim q_{\theta}, <br>
$$
We can use the following reformulation of integrals we often encounter in variational upper bounds. <br>
$$
\frac{\partial}{\partial \theta_i}\mathbb{E}_{x \sim q_\theta}\left[f(x, q_\theta(x))\right] = \mathbb{E}_{\epsilon \sim p_\epsilon}\left[ \frac{\partial}{\partial \theta_i} f(h(\epsilon, \theta), q_\epsilon(h(\epsilon, \theta)))\right]
$$
A Monte Carlo estimators to this expectation typically have substantially lower variance than REINFORCE estimators to the same quantity.</p>

<h2 id="adversarialgames">Adversarial games</h2>

<h4 id="typicalproblem">Typical problem:</h4>

<p>I can't estimate my loss function $f(\theta)$ directly from samples, typically because the loss function depends on the pdf of the data distribution or that of my model, or both.</p>

<h4 id="solution">Solution:</h4>

<p>We can sometimes construct an approximation so that <br>
$$
f(\theta) \approx h(\theta, \operatorname{argmin}_\psi g(\theta, \psi)), <br>
$$
and then we can solve the problem of finding a stable equilibrium of the two-player game where players minimize losses $g$ and $h$ with respect to $\psi$ and $\theta$, respectively. <br>
Sometimes, the approximations may come in the form of lower-bounds, which is the case when $h=-g$: <br>
$$
f(\theta) \geq \sup_\psi g(\theta, \psi), <br>
$$
in which case we can solve the following minimax problem instead: <br>
$$
\theta^\ast \leftarrow \operatorname{argmin}_{\theta} \max_{\psi} g(\theta, \psi)
$$</p>

<h4 id="tricksofthetrade">Tricks of the trade:</h4>

<p><em>Bayes optimality in auxiliary tasks:</em> When your loss function depends on densities of probability distributions you can easily sample from, often you can construct an auxiliary task, whose Bayes optimal solution depends on values of the densities in question. Examples of auxiliary tasks are: binary classification for likelihood ratio estimation, denoising or score matching for estimating the score function.</p>

<p><em>Convex conjugate:</em> In case your loss function involves convex functional of the densities (such as in f-divergences), you can transform your problem by re-expressing it in terms of the convex conjugate. The expression for $f$ in terms of its convex conjugate $f^{\ast}$ is:
\begin{align}
f(u) &amp;= \sup_{v \in \operatorname{dom}(f^{\ast})} \{ \langle u, v \rangle - f^{\ast}(v) \} \\ <br>
&amp;\geq \sup_{\psi} \{ \langle u, v_\psi \rangle - f(v_\psi) \},
\end{align}
Note that if $u$ is a density function, then the inner product $\langle u,v_\psi\rangle$ is the expectation of $v_\psi$, which can be approximated by Monte Carlo sampling.</p>

<h2 id="evolutionstrategies">Evolution strategies</h2>

<h4 id="typicalproblem">Typical problem</h4>

<p>My $f(\theta)$ is easy to evaluate but hard to optimize, perhaps because it contains discrete operations, or the function is piecewise constant. Backpropagation is not possible.</p>

<h4 id="solution">Solution</h4>

<p>Observe that for any probability distribution $p_\psi$ over $\theta$ the following holds: <br>
$$
\min_{\theta}f(\theta) \leq \min_{\psi} \mathbb{E}_{\theta\sim p_\psi} f(\theta)
$$
therefore in ES we concentrate of the following optimization problem instead: <br>
$$
\psi^{\ast} \leftarrow \operatorname{argmin}_{\psi} \mathbb{E}_{\theta\sim p_\psi} f(\theta)
$$
Often, depending on the function $f$ and class of distributions $p_\psi$, a local minimum of $f$ can be recovered from a local minimum of $\psi$.</p>

<h4 id="tricksofthetrade">Tricks of the trade</h4>

<p><em>REINFORCE gradient estimator:</em> It relies on the following trick
$$
\frac{\partial}{\partial \psi_i} \mathbb{E}_{\theta\sim p_\psi}[f(\theta)] = \mathbb{E}_{\theta \sim p_\psi}\left[\frac{\partial}{\partial \psi_i}\log p_\psi(\theta)f(\theta)\right],
$$
where the RHS can be easily approximated by Monte Carlo. The variance of Monte Carlo REINFORCE estimators tends to be quite high.</p>

<h2 id="convexrelaxation">Convex relaxation</h2>

<h4 id="typicalproblem">Typical problem</h4>

<p>My $f(\theta)$ is hard to optimize, because it has non-differentiable and non-convex components like the $\ell_0$-norm of a vector in sparse methods, or the Heaviside step function in classification.</p>

<h4 id="solution">Solution</h4>

<p>Replace the non-convex component by a convex approximation, turning your objective into a now typically convex $g$ <br>
$$
f(\theta) \approx g(\theta) <br>
$$</p>

<h4 id="tricksofthetrade">Tricks of the trade:</h4>

<p><em>$\ell_1$ loss:</em> In many sparse learning situations we wish to minimize counts of non-zero entries in a vector, which is known as $\ell_0$ loss. You can typically replace this by the $\ell_1$ norm of the vector.</p>

<p><em>hinge loss and large margin methods:</em> The error rate of a binary classifier under zero-one loss, the objective is typically a piecewise constant function of the parameters of a classifier and is thus hard to optimize. We can replace the zero-one-loss with the hinge loss, which can be understood as a convex upper bound. The resulting optimization problem will have a tendency to maximize the classifier's margin.</p>]]></content:encoded></item><item><title><![CDATA[Gaussian Distributions are Soap Bubbles]]></title><description><![CDATA[<p>This post is just a quick note on some of the pitfalls we encounter when dealing with high-dimensional problems, even when working with something as simple as a Gaussian distribution.</p>

<p>Last week I <a href="http://www.inference.vc/mixup-data-dependent-data-augmentation/">wrote about <em>mixup</em></a> a new data-augmentation scheme that achieves good performance in a few domains. I illustrated</p>]]></description><link>http://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/</link><guid isPermaLink="false">43d6712e-d189-4811-9055-25b71b5d7c38</guid><dc:creator><![CDATA[Ferenc Huszar]]></dc:creator><pubDate>Thu, 09 Nov 2017 14:57:57 GMT</pubDate><media:content url="http://www.inference.vc/content/images/2017/11/3066-1.jpg" medium="image"/><content:encoded><![CDATA[<img src="http://www.inference.vc/content/images/2017/11/3066-1.jpg" alt="Gaussian Distributions are Soap Bubbles"><p>This post is just a quick note on some of the pitfalls we encounter when dealing with high-dimensional problems, even when working with something as simple as a Gaussian distribution.</p>

<p>Last week I <a href="http://www.inference.vc/mixup-data-dependent-data-augmentation/">wrote about <em>mixup</em></a> a new data-augmentation scheme that achieves good performance in a few domains. I illustrated how the method works in a two-dimensional example. While I love understanding problems visually, it can also be problematic, as things work very, very differently in high-dimensions. Even something as simple as a Gaussian distribution does.</p>

<h4 id="summaryofthispost">Summary of this post:</h4>

<ul>
<li>in high dimensions, Gaussian distributions are practically indistinguishable from uniform distributions on the unit sphere.</li>
<li>when drawing figures with additive Gaussian noise, a useful sanity check is normalizing your Gaussian noise samples and see how that changes the conclusions you would draw.</li>
<li>you should be careful when using the mean or the mode to reason about a distribution in a high-dimensional space (e.g. images). The mode may look nothing like typical samples.</li>
<li>you should also be careful about linear interpolations in latent spaces where you assume a Gaussian distribution.</li>
</ul>

<h2 id="soapbubble">Soap bubble</h2>

<p>One of the properties of the Gaussian distributions is that in high-dimensions, it is almost indistinguishable from a uniform distribution over a unit-sphere. Try drawing a 10k dimensional standard normal and then calculate the Euclidean norm: It's going to be pretty close to 100, which is square root of dimensions. Here's a histogram of the norm of 1000 different samples from a 10k standard normal:</p>

<p><img src="http://www.inference.vc/content/images/2017/11/download-90.png" alt="Gaussian Distributions are Soap Bubbles"></p>

<p>If I show you a Gaussian distribution in 3D and ask you to find a physical metaphor for it, I think at least some of you would reasonably describe it as something like a ball of mold: it's kind of soft or fluffy but gets more and more dense in the middle. Something like this guy:</p>

<p><img src="http://www.inference.vc/content/images/2017/11/mold.jpg" alt="Gaussian Distributions are Soap Bubbles"></p>

<p>But in high-dimensional spaces, the image you should have in your mind is more like a soap bubble:</p>

<div class="image-div" style="width: 400px;">  
<img src="http://www.inference.vc/content/images/2017/11/3066.jpg" alt="Gaussian Distributions are Soap Bubbles">
</div>

<p>Here's a tool I sometimes use - not nearly often enough - to double check my intuitions: explicitly replace Gaussian distributions with a uniform over the unit ball, or in case of sampling: normalize my Gaussian samples so they all more or less have the same norm. For example, let's see what happens when we look at instance noise or Gaussian data augmentation through this lens:</p>

<p><img src="http://www.inference.vc/content/images/2017/11/download-92.png" alt="Gaussian Distributions are Soap Bubbles"></p>

<p>Oops. That doesn't look very nice, does it? So whenever you think about additive Gaussian noise as convolving a distribution with a smooth Gaussian kernel, try to also think about what happens if you convolved your distribution with a circle instead. In high-dimensional spaces the two are virtually indistinguishable from each other.</p>

<h2 id="modevstypicalsamples">Mode vs typical samples</h2>

<p>Another example of how reasoning can go wrong is looking at the mode of a distribution and expecting it to look like typical random samples drawn from the distribution. Consider image samples where the distribution is drawn from Gaussian distribution:</p>

<p><img src="http://www.inference.vc/content/images/2017/11/download-94.png" alt="Gaussian Distributions are Soap Bubbles"></p>

<p>What does the mean or mode of this distribution look like? Well, it's a grey square. It looks nothing like the samples above. What's worse, if instead of white noise, you would consider correlated noise, and then looked at the mode, it could still be the same grey square.</p>

<p>Looking at the mode of the distribution can therefore be misleading. The maximum of a distribution may look nothing like typical samples from a distribution. Consider the beautiful images from the recent <a href="https://distill.pub/2017/feature-visualization/">feature visualization</a> article on distill. I'm going to steal one of the beautiful pictures here, otherwise please go read and look at the entire thing.</p>

<p><img src="http://www.inference.vc/content/images/2017/11/Screen-Shot-2017-11-09-at-2.12.44-PM.png" alt="Gaussian Distributions are Soap Bubbles"></p>

<p>The top panels show images sampled from the training dataset where the neuron's activations are high. The bottom images are obtained by optimizing the input to maximize the response. Therefore one of the bottom images is a little bit like the mode of a distribution, while the corresponding top images are a bit like random samples from the distribution (not quite, but in essence).</p>

<p>The fact that these don't look like each other should not be surprising in itself. If, say, a neuron would respond perfectly to images of Gaussian noise, then the input which maximizes it's activation may look like a grey square, nothing like the samples on which it typically activates and nothing like samples from a Gaussian distribution it has learnt to detect. Therefore, while these images are really cool, and beautiful, their diagnostic value is limited in my opinion. One should not jump to the conclusion that 'the neuron may not be detecting what you initially thought' because the mode of a distribution is not really a good summary in high dimensions.</p>

<p>This is complicated even more by the fact that when we look at an image, we use our own visual system. The visual system learns by being exposed to natural stimuli, which are usually modeled as random samples from a distribution. Our retina has never seen the mode of the natural image distribution, so our visual cortex may not even know how to respond to it, we might find it completely weird. If somebody showed the world the mode of their perfect image density model, we could probably still not tell if it is correct or not because no one has ever seen the original.</p>

<h2 id="interpolation">Interpolation</h2>

<p><em>UPDATE: Previously this section incomplete: I failed to mention that my interpolation scheme only works if the two endpoints are orthogonal which tends to happen with high probability in high dimensions. Thanks for <a href="https://twitter.com/Phantrosity/status/982487744285986822?s=20">@Phantrosity</a> for highlighting this.</em></p>

<p>The last thing I wanted to mention about Gaussians and high dimensional spaces is linear interpolation. If you take two samples, $x_1$ and $x_2$ from a Gaussian distribution in a high-dimensional space, they will both have roughly the same Euclidean norm, and they will be roughly orthogonal. When you consider a convex combination $p x_1 + (1 - p) x_2$ that convex combination has a norm that is proportional to $\sqrt{p^2 + (1 - p)^2} &lt; 1$. Therefore, interpolating this way takes you inside the soap bubble (hence the name convex combination, BTW). If you want to interpolate between two random without leaving the soap bubble you should instead interpolate in in polar coordinates, for example by doing $\sqrt{p} x_1 + \sqrt{(1 - p)} x_2$ as illustrated below:</p>

<p><img src="http://www.inference.vc/content/images/2017/11/download-95.png" alt="Gaussian Distributions are Soap Bubbles"></p>

<p>This figure shows the effect of interpolating linearly in cartesian space (top row ) or linearly in polar space (bottom row) between two Gaussian images (left and right). The plot at the top shows how the norm changes as we interpolate linearly or in polar space, you can see that midway between the two images, the norm is quite a bit lower than the norm of either endpoints. You can actually see that linear interpolation results in 'fainter' interpolants.</p>

<p>So generally, when interpolating in latent spaces of something like a VAE or a GAN where you assume the distribution was Gaussian, always interpolate in polar coordinates, rather than in cartesian coordinates. As kindly pointed out by commenters, the scheme I proposed here only works well if the endpoints are independently sampled from a high-D Gaussian and therefore approximately form an orthonormal basis. For a more robust spherical interpolation scheme you might want to opt for something like <a href="https://en.wikipedia.org/wiki/Slerp">SLERP</a>.</p>

<h2 id="innerproducts">Inner products</h2>

<p>Did I mention two random samples drawn from a standard Normal in a high-dimensional space are orthogonal to each other with a very high probability? Well, they are.</p>]]></content:encoded></item><item><title><![CDATA[mixup: Data-Dependent Data Augmentation]]></title><description><![CDATA[<p>By popular demand, here is my post on <em>mixup</em>, a new data augmentation scheme that was shown to improve generalization and stabilize GAN performance.</p>

<ul>
<li>H Zhang, M Cisse, YN Dauphin and D Lopez-Paz (2017) <a href="https://arxiv.org/abs/1710.09412"><em>mixup</em>: Beyond Empirical Risk Minimization</a></li>
</ul>

<p>I have to say I have not seen this paper before</p>]]></description><link>http://www.inference.vc/mixup-data-dependent-data-augmentation/</link><guid isPermaLink="false">871c52f9-3a37-436f-a0b3-f62ff0d41e51</guid><dc:creator><![CDATA[Ferenc Huszar]]></dc:creator><pubDate>Thu, 02 Nov 2017 15:40:04 GMT</pubDate><media:content url="http://www.inference.vc/content/images/2017/11/download-87-1.png" medium="image"/><content:encoded><![CDATA[<img src="http://www.inference.vc/content/images/2017/11/download-87-1.png" alt="mixup: Data-Dependent Data Augmentation"><p>By popular demand, here is my post on <em>mixup</em>, a new data augmentation scheme that was shown to improve generalization and stabilize GAN performance.</p>

<ul>
<li>H Zhang, M Cisse, YN Dauphin and D Lopez-Paz (2017) <a href="https://arxiv.org/abs/1710.09412"><em>mixup</em>: Beyond Empirical Risk Minimization</a></li>
</ul>

<p>I have to say I have not seen this paper before people on twitter suggested I should write a post about this - which was yesterday. So these are all very fresh thoughts, warranty not included.</p>

<h3 id="summaryofthispost">Summary of this post</h3>

<ul>
<li>I start by introducing mixup, and immediately reformulate it  in such a way that does not involve convex combinations of labels anymore, making it more comparable to traditional data augmentation. This reformulation works as long as we use binary or multinomial cross-entropy loss.</li>
<li>This formulation immediately opens the door for using mixup in a <em>semi-supervised</em> fashion, and will help us later when thinking about its generalization properties.</li>
<li>I talk about a way to understand generalization of mixup in the context of binary classification, and express the ideal test error in a form that can be evaluated in closed form for classification. </li>
<li>I argue that the reason this works well for GANs is similar to why <a href="http://www.inference.vc/instance-noise-a-trick-for-stabilising-gan-training/">instance noise</a> works.</li>
</ul>

<h2 id="mixup">Mixup</h2>

<p>Let's jump right in the middle, here is how the <em>mixup</em> training loss is defined:</p>

<p>$$
\mathcal{L}(\theta) = \mathbb{E}_{x_1,y_1\sim p_{train}} \mathbb{E}_{x_2,y_2\sim p_{train}} \mathbb{E}_{\lambda\sim\beta(0.1)} \ell(\lambda x_1 + (1 - \lambda) x_2, \lambda y_1 + (1 - \lambda) y_2)
$$</p>

<p>Very simply, we take pairs of datapoints $(x_1, y_1)$ and $(x_2, y_2)$, then choose a random mixing proportion $\lambda$ from a Beta distribution, and create an artificial training example $(\lambda x_1 + (1 - \lambda) x_2, \lambda y_1 + (1 - \lambda) y_2)$. We train the network by minimizing the loss on mixed-up datapoints like this. This is all.</p>

<p>One intuition behind this is that by linearly interpolating between datapoints, we incentivize the network to act smoothly and kind of interpolate nicely between datapoints - without sharp transitions.</p>

<h2 id="reformulation">Reformulation</h2>

<p>Now, let's assume the loss $\ell$ is linear in it's second argument, such that</p>

<p>$$
\ell(x, p y_1 + (1 - p) y_2) =  p \ell(x, y_1) + (1 - p) \ell(x, y_2)
$$</p>

<p>This is the case in classification, where the loss is the binary cross entropy $\ell(x,y) = - y log p(x;\theta) - (1 - y) log (1 - p(x;\theta))$. It also works the same way for one-hot-encoded categorical labels.</p>

<p>In these cases, we can rewrite the <em>mixup</em> objective as</p>

<p>\begin{align}
&amp;\mathbb{E}_{x_1,y_1\sim p_\mathcal{D}} \mathbb{E}_{x_2,y_2\sim p_\mathcal{D}} \mathbb{E}_{\lambda\sim\beta(\alpha, \alpha)} \ell(\lambda x_1 + (1 - \lambda) x_2, \lambda y_1 + (1 - \lambda) y_2) =\\
&amp;\mathbb{E}_{x_1,y_1\sim p_\mathcal{D}} \mathbb{E}_{x_2,y_2\sim p_\mathcal{D}} \mathbb{E}_{\lambda\sim\beta(\alpha, \alpha)} \lambda \ell(\lambda x_1 + (1 - \lambda) x_2, y_1) + (1 - \lambda) \ell(\lambda x_1 + (1 - \lambda) x_2, y_2)=\\
&amp;\mathbb{E}_{x_1,y_1\sim p_\mathcal{D}} \mathbb{E}_{x_2,y_2\sim p_\mathcal{D}} 
\mathbb{E}_{\lambda\sim\beta(\alpha, \alpha)}
\mathbb{E}_{z \sim Ber(\lambda)} z \ell(\lambda x_1 + (1 - \lambda) x_2, y_1) + (1 - z) \ell(\lambda x_1 + (1 - \lambda) x_2, y_2)=\\
&amp;\mathbb{E}_{x_1,y_1\sim p_\mathcal{D}} \mathbb{E}_{x_2,y_2\sim p_\mathcal{D}} 
\mathbb{E}_{z \sim Ber(0.5)}
\mathbb{E}_{\lambda \sim \beta(\alpha + 1 - z, \alpha + z)} z \ell(\lambda x_1 + (1 - \lambda) x_2, y_1) + (1 - z) \ell(\lambda x_1 + (1 - \lambda) x_2, y_2)=\\
&amp;\frac{1}{2}\mathbb{E}_{x_1,y_1\sim p_\mathcal{D}} \mathbb{E}_{x_2,y_2\sim p_\mathcal{D}}
\mathbb{E}_{\lambda \sim \beta(\alpha, \alpha + 1)} \ell(\lambda x_1 + (1 - \lambda) x_2, y_1) + 
\frac{1}{2}\mathbb{E}_{x_1,y_1\sim p_\mathcal{D}} \mathbb{E}_{x_2,y_2\sim p_\mathcal{D}}
\mathbb{E}_{\lambda \sim \beta(\alpha + 1, \alpha)} \ell(\lambda x_1 + (1 - \lambda) x_2, y_2) = \\
&amp;\mathbb{E}_{x,y\sim p_\mathcal{D}}
\mathbb{E}_{x'\sim p_\mathcal{D}}
\mathbb{E}_{\lambda \sim \beta(\alpha, \alpha + 1)} 
\ell(\lambda x + (1 - \lambda) x', y)
\end{align}</p>

<p>Line by line, I used the following tricks:</p>

<ol>
<li>linearity of the loss, as in assumption above</li>
<li>expectation of a Bernoulli($\lambda$) variable plus linearity of expectation</li>
<li>Bayes' rule $p(z\vert \lambda)p(\lambda) = p(\lambda\vert z)p(z)$ and the fact that the Beta distribution is conjugate prior for the Bernoulli</li>
<li>expectation of a Bernoulli(0.5) plus linearity of expectation</li>
<li>symmetry of the Beta distribution in the sense that $\lambda \sim Beta(a,b)$ implies $1-\lambda \sim Beta(b,a)$, plus changing variable names in the expectation so the two terms become the same</li>
</ol>

<p>So, here is what we are left with:</p>

<p>$$
\mathcal{L}(\theta) = \mathbb{E}_{(x,y)\sim p_\mathcal{D}}
\mathbb{E}_{\lambda \sim \beta(\alpha, \alpha + 1)} 
\mathbb{E}_{(x')\sim p_\mathcal{D}}
\ell(\lambda x + (1 - \lambda) x', y)
$$</p>

<p>I think this formulation is much nicer, because:</p>

<ul>
<li>you no longer have blending of labels, which to me was weirdest thing to begin with. This is a lot like the usual data augmentation, which only manipulates the input $x$, but keeps the label $y$ the same. The only thing that changes compared to data augmentation is that the vicinal distribution (augmentation distribution) applied to $x$ is dependent on $p_{train}(x)$, the distribution of inputs.</li>
<li>we no longer use the joint distribution $p_\mathcal{D}(x,y)$, but only $p_\mathcal{D}(x)$ to define the vicinal distribution. This means that, technically, this form can be used in a <strong>semi-supervised</strong> setting, as follows:</li>
</ul>

<p>$$
\mathcal{L}(\theta) = 
\mathbb{E}_{x,y\sim p_\mathcal{\text{labelled}}}
\mathbb{E}_{\lambda \sim \beta(\alpha, \alpha + 1)} 
\mathbb{E}_{x'\sim p_\mathcal{\text{unlabelled}}}
\ell(\lambda x + (1 - \lambda) x', y)
$$</p>

<h4 id="pytorchcode">Pytorch code</h4>

<p>Here's how you'd modify the pytorch code from the paper to make this work (I draw different lambdas per datapoint in the minibatch, they draw one lambda per minibatch, not sure which one works better):</p>

<pre><code>for (x1, y1), (x2, _) in zip(loader1, loader2):  
    lam = numpy.random.beta(alpha+1, alpha, batchsize)
    x = Variable(lam * x1 + (1. - lam) * x2)
    y = Variable(y1)
    optimizer.zero_grad()
    loss(net(x), y).backward()
    optimizer.step()
</code></pre>

<p>Only three changes: </p>

<ul>
<li>you ignore the labels <code>y2</code> from the second minibatch - hence the <code>_</code></li>
<li>lambda is now drawn from an asymmetric $Beta(\alpha,\alpha+1)$ (note numpy orders arguments of beta differently).</li>
<li>you just use the label from the first minibatch - no blending there</li>
</ul>

<p>This should do roughly the same thing.</p>

<h2 id="letsvisualizethis">Let's visualize this</h2>

<p>Let's look at what the this data-dependent augmentation looks like for a single datapoint on the two-moons dataset:</p>

<p><img src="http://www.inference.vc/content/images/2017/11/download-82.png" alt="mixup: Data-Dependent Data Augmentation"></p>

<p>The white and black crosses are positive and negative examples respectively. The mixup data augmentation doesn't care about the labels, just the distribution of the data. I applied the mixup to the datapoint at roughly $x=(-0.7, 0.6)$. The transparent blue dots show random samples from the vicinal distribution. Each blue point was obtained by picking another training datapoint $x'$ at random, then picking a random $\lambda$ from a $Beta(0.1, 1.1)$ and then interpolating between $x$ and $x'$ accordingly. Note that $Beta(0.1, 1.1)$ is not symmetric, roughly 80% of sampled $\lambda$ will be higher than $0.9$ sowe tend to end up much closer to $x$ than $x'$. All these blue dots would be added to the training data with a 'white' label.</p>

<p>This is what the picture looks like when we applied data augmentation to two training examples, one positive, and one negative, and using 10k unlabelled samples (I am plotting a few of those unlabelled samples for reference):</p>

<p><img src="http://www.inference.vc/content/images/2017/11/download-85.png" alt="mixup: Data-Dependent Data Augmentation"></p>

<p>You can see that the vicinal distribution of mixup does not really follow the manifold - which one would hope a good semi-supervised data augmentation scheme would do in this particular example. It does something rather weird. Things don't look that much cleaner whe we have more labelled examples:</p>

<p><img src="http://www.inference.vc/content/images/2017/11/download-87.png" alt="mixup: Data-Dependent Data Augmentation"></p>

<p>Finally, with all datapoints labelled:</p>

<p><img src="http://www.inference.vc/content/images/2017/11/download-88.png" alt="mixup: Data-Dependent Data Augmentation"></p>

<h2 id="whyshoulddataaugmentationgeneralizebetter">Why should data augmentation generalize better?</h2>

<p>Why should a data augmentation scheme - or vicinal risk minimization (VRM) generalize better in classification? Generalization gap is about the difference between training and validation losses which is there because the training and test distributions differ (in practice, these are both empirical distributions concentrated on a number of samples). Vicinal risk minimization (VRM) or data augmentation trains by minimizing risk on the augmented training distribution:</p>

<p>$$
p_{\text{aug}}(x,y) = \frac{1}{N_{train}}\sum_{x,y\in \mathcal{D}_{train}} \nu(x', y'\vert x, y), <br>
$$</p>

<p>for some vicinal distribution $\nu(x', y'\vert x, y)$. This should be successful if the augmented distribution $p_{aug}$ actually ends up closer to $p_{test}$ than the original training $p_{train}$. Following the ideas of <a href="http://proceedings.mlr.press/v15/lacoste_julien11a/lacoste_julien11a.pdf">(Lacoste-Julien et al, 2011)</a>, also or section 1.3.7 of <a href="https://github.com/fhuszar/thesis/blob/master/submitted/thesis.pdf">my PhD thesis</a>, we can even find the divergence measure we should use to measure the difference between these distributions, based on Bayes decision theory. If we use the binary cross-entropy both for training and testing, the Bayes optimal decision function on the augmented data will be (assuming perfectly balanced classes):</p>

<p>$$
q_{aug}(x) = \frac{p_{aug}(x\vert 1)}{p_{aug}(x\vert 1) + p_{aug}(x\vert 0)} <br>
$$</p>

<p>When <em>mixup</em> mixes up the empirical distribution of the two classes, it turns them into continuous distributions with perfectly overlapping support. Therefore, the Bayes optimal decision function is more or less unique, so we should be able to find $q_{aug}(x)$, or something very close to it quite consistently during training.</p>

<p>The loss of this optimal training classifier $q_{aug}(x)$ on the test data is as follows:</p>

<p>\begin{align}
\mathcal{L}_{test}(q_{aug}) &amp;= \mathbb{E}_{y\sim p_{test}} \mathbb{E}_{x\sim p_{test}(x\vert y)} - \log(p_{aug}(x\vert y)) + \mathbb{E}_{x \sim p_{test}} \log(p_{aug}(x)) \\
 &amp;= \mathbb{E}_{y\sim p_{test}} \operatorname{KL}[p_{test}(x\vert y) \| p_{aug}(x\vert y)] - \operatorname{KL}[p_{test}(x) \| p_{aug}(x)]
\end{align}</p>

<p>Interestingly, if we use $p_{test}(x)$ in the mixup, we an actually evaluate these quantities exactly, which gives us a lower bound on the training error.</p>

<p>So, let's answer the question: why might data augmentation generalize better:</p>

<ol>
<li>because the training loss becomes better defined such that the Bayes-optimal solution is unique and easier to find consistently. In the case of mixup this happens because to the fact that the class-conditional distributions end up having fully overlapping support.</li>
<li>because data augmentation turns the training distribution into a distribution that is closer to the test distribution. The ideal generalization gap can be seen as Bregman divergence between $p_{test}$ and $p_{aug}$, up to a constant. In the case of mixup, we can actually calculate this divergence if we use $p_{train}(x)$ in the vicinal distribution.</li>
</ol>

<h2 id="whyshoulditworkwellforgans">Why should it work well for GANs?</h2>

<p>I suspect reason number 1 above also explains why mixup works well for training GANs. One of the issues with the usual GAN setup is that the training and synthetic distributions have widely different support, and are often concentrated to lower-dimensional manifolds. </p>

<p><a href="http://www.inference.vc/instance-noise-a-trick-for-stabilising-gan-training/">Instance noise</a> was initially introduced to alleviate this problem by making the support of the two distributions overlap. See also <a href="https://arxiv.org/abs/1701.04862">(Arjovsky and Buttou, 2017)</a>. Mixup achieves the same thing, and I would imagine it does so even better in practice.</p>

<h2 id="whyshoulditworkwellagainstadversarialexamples">Why should it work well against adversarial examples?</h2>

<p>Resilience to adversarial examples is somewhat related to, but crucially different from generalization. Firstly, the test data in the adversarial setting $p_{test}$ is generated specifically to fool our classification function - it's no longer independent of the training set or the training algorithm. Secondly, and crucially, adversarial examples are created on the basis of the gradient of the decision function. However, the training loss only depends on values of the decision function at certain training points. For a decision function to achieve minimal training loss, it doesn't even have to be differentiable, or continuous. In fact, we could even implement it as a look-up-table, which would be memorization. So adversarial examples are created by a property of the decision function that the empirical loss is perfectly insensitive to.</p>

<p>The way mixups addresses this is twofold:</p>

<ol>
<li>due to the vicinal distribution, the training loss function suddenly starts to care about a local neighbourhood of the decision function around a training datapoint - and the behaviour in that local neighbourhood will depend on the gradients.</li>
<li>It is evident from the figures above that mixup tends to produce data augmentations to cover the likely location of where adversarial examples might end up.</li>
</ol>

<h2 id="summary">Summary</h2>

<p><em>note that I have updated the summary as I reevaluated my conclusions.</em></p>

<p>Mixup is an interesting technique, but I don't believe it's the final version that we will see in the wild. My initial instinct was to somehow combine this idea with nearest-neighbour-like things which tend to work quite well for manifold-like data. For example: restrict mixup only between nearest neighbours, or make $\alpha$ depend on distance. After another day of thinking, I'm not so sure that's a great idea generally. Although incorporating nearest-neighbour to mixup would probably look very good on the 2D two-moons toy example, real data doesn't look like that. More importantly, real data people want to apply this to is high dimensional. Nearest neighbour methods in pixel space probably wouldn't work quite as well as they do on the nice 2D manifold for the same reason Gaussian noise doesn't work very well for high-D problems.</p>]]></content:encoded></item><item><title><![CDATA[AlphaGo Zero: Minimal Policy Improvement, Expectation Propagation and other Connections]]></title><description><![CDATA[<p>This is a post about the new reinforcement learning technique that enables AlphaGo Zero to learn Go from scratch via self-play. The paper has been out for a week I guess it's now considered old - sorry for the latency.</p>

<ul>
<li>D Silver, J Schrittwieser, K Simonyan, I Antonoglou, A Huang,</li></ul>]]></description><link>http://www.inference.vc/alphago-zero-policy-improvement-and-vector-fields/</link><guid isPermaLink="false">950c2c06-0a86-45e0-ac8d-d47bfd7910a6</guid><dc:creator><![CDATA[Ferenc Huszar]]></dc:creator><pubDate>Thu, 26 Oct 2017 14:12:19 GMT</pubDate><content:encoded><![CDATA[<p>This is a post about the new reinforcement learning technique that enables AlphaGo Zero to learn Go from scratch via self-play. The paper has been out for a week I guess it's now considered old - sorry for the latency.</p>

<ul>
<li>D Silver, J Schrittwieser, K Simonyan, I Antonoglou, A Huang,    Arthur Guez, T Hubert, L Baker, M Lai,  Adrian Bolton, Y Chen, T Lillicrap, F Hui, L Sifre, G van den Driessche, T Graepel  &amp; D Hassabis (2017) <a href="https://www.nature.com/articles/nature24270.epdf?author_access_token=VJXbVjaSHxFoctQQ4p2k4tRgN0jAjWel9jnR3ZoTv0PVW4gB86EEpGqTRDtpIz-2rmo8-KG06gqVobU5NSCFeHILHcVFUeMsbvwS-lxjqQGg98faovwjxeTUgZAUMnRQ">Mastering the Game of Go without Human Knowledge</a></li>
</ul>

<p>I'm no expert in RL, so I'm pretty sure many of you are going to come at me with pitchforks shouting "this is all trivial" or "this has been done before" or "this is no different from X". Please do, I'm here to learn.</p>

<h4 id="summaryofthispost">Summary of this post</h4>

<ul>
<li>the key idea behind AlphaGo Zero is interpreting Monte Carlo Tree Search (MCTS) as policy improvement operator: You can think of MCTS as an operator that takes your current policy and turns it into a better policy.</li>
<li>you can formulate the goal of AlphaGo Zero as seeking a fixed point with respect to the policy improvement operator: the best policy is one that cannot be further improved by MCTS.</li>
<li>once such policy is found, you no longer need MCTS when you play a game. As a result, AlphaGo Zero is a lot more efficient to run, once trained.</li>
<li>I look at this idea a bit more generally, and connect it to the previous post on non-conservative vector fields and the Numerics of GANs.</li>
<li>Finally, I highlight connections to other learning or inference algorithms that are similar in nature: expectation-propagation and contrastive divergence.</li>
</ul>

<h1 id="theminimalpolicyimprovementtechnique">The Minimal Policy Improvement Technique</h1>

<p><strong>Background:</strong> The original AlphaGo used a combination of two neural networks - the policy and value networks - and a Monte Carlo Tree Search (MCTS) algorithm to play Go. For each move, the policy network is first evaluated to give an initial strategy $\pmb{p}$. This is then used to prime MCTS, which then performs rollouts - simulations of possible future game trajectories - evaluating the policy network and the value network multiple times in these imagined future scenarios. MCTS refines AlphaGo's initial intuitive strategy $\pmb{p}$, and outputs a better strategy $\pmb{\pi}$ which is usually concentrated on stronger moves in the current situation.</p>

<p>The key idea of AlphaGo Zero is this: you can think of MCTS generally as a policy improvement operator, $\pmb{\pi}(\pmb{p})$, which takes $\pmb{p}$ as input and outputs a better policy $\pmb{\pi}$. If you have such a policy improvement operator, it makes sense to define the goal of learning as finding a policy $\pmb{p}^{\ast}$ which the policy improvement operator cannot improve any further, in other words, we seek the following fix point:</p>

<p>$$
\pmb{\pi}(\pmb{p}^{\ast}) = \pmb{p}^{\ast}.
$$</p>

<p>For now, let's assume two things:</p>

<ul>
<li>the policy improvement operator $\pmb{\pi}$ depends only on the output of the policy $\pmb{p}$, and</li>
<li>the state of the game $s$ is fixed, in other words I am concerned with what the network does in a particular fixed state, say what it chooses as the starting move. I only assume this to avoid having to spell out expectations over states, etc.</li>
</ul>

<p>Under these assumptions, we can write the policy improvement algorithm as seeking stationary points of the following vector field $\pmb{v}$:</p>

<p>$$
\pmb{v}(\pmb{p}) = \pmb{\pi}(\pmb{p}) - \pmb{p}
$$</p>

<p>Here we go again, looking for stationary points $\pmb{v}(\pmb{p}) = \pmb{0}$ of a vector field, just like in the previous post about GANs.</p>

<p>The most natural approach to find such stationary points is Euler's method - which would be gradient descent if the vector field happened to correspond to the gradients of a scalar loss. Of course, there is no guarantee that $\mathbb{v}$ is conservative, and therefore that Euler's method will converge. But hey, it's worth a try. What would Euler's method do on this vector field?</p>

<p>\begin{align}
\theta_{t+1} &amp;\leftarrow \theta_{t} + h\pmb{v}(\pmb{p})\frac{\partial \pmb{p}}{\partial \theta}\\
&amp;= \theta_{t} + h \left(\pmb{\pi} - \pmb{p}\right)\frac{\partial \pmb{p}}{\partial \theta}
\end{align}</p>

<p>This is almost, but not quite, what AlphaGo Zero does. It differs in two significant ways:</p>

<h2 id="dependenceofmctsonparameters">Dependence of MCTS on parameters</h2>

<p>There is a catch, which only becomes clear once we make the dependence on parameters explicit:</p>

<p>$$
\pmb{v}(\pmb{p}_\theta, \theta) = \pmb{\pi}(\pmb{p}_\theta, \theta) - \pmb{p}_\theta
$$</p>

<p>MCTS itself depends not only on the initial guess $\pmb{p_\theta}$ but also on the parameters $\theta$ directly. This is because the policy-value network $f_\theta$ is evaluated multiple times within the MCTS algorithm. What this means is that the vector field $\pmb{v}$ also depends on the model parameters $\theta$ beyond its dependency of $\pmb{p}_\theta$.</p>

<p>Thus, our interpretation as seeking the stationary point of a single vector field in $\pmb{p}$-space isn't quite accurate. The vector field basically changes each time we change the parameter $\theta$, as we change the parameters $\theta$. In fact we are searching for $\theta^{\ast}$ such that</p>

<p>$$
\pmb{\pi}(\pmb{p}_{\theta^{\ast}}, \theta^{\ast}) = \pmb{p}_{\theta^{\ast}}
$$</p>

<p>if we want to solve this directly with gradient descent, then we would have to take a gradient of MCTS with respect to $\theta$. Good luck with that.</p>

<p>The way AlphaGo sidesteps this problem is by removing/ignoring the dependence of $\pmb{\pi}$ on $\pmb{p}$ and $\theta$. We simply calculate MCTS with the current value of the parameters, and then note down the value of the resulting $\pmb{\pi}$ and treat it as a regression target. AGZ's learning converges when</p>

<p>$$
\pmb{\pi}(\pmb{p}_{\theta^{\text{recent}}}, \theta^{\text{recent}}) = \pmb{p}_{\theta},
$$</p>

<p>where $\theta_{\text{recent}}$ is some recent value of the parameter $\theta$ from an earlier iteration of training.</p>

<h2 id="crossentropyloss">Cross-entropy loss</h2>

<p>AlphaGo differs from the Euler method on the vector field in another way. Interestingly, this, too, is similar to something from <a href="http://www.inference.vc/my-notes-on-the-numerics-of-gans/">the last post</a>. Instead of following the <em>improvement gradient</em> $\pmb{\pi} - \pmb{p}$, the method instead performs gradient descent on a scalar loss function </p>

<p>$$
\mathcal{L}(\theta) = KL(\pmb{\pi}_\theta \| \pmb{p}_\theta)
$$</p>

<p>This is a generalization of trying to minimize the norm of the $\pmb{\pi} - \pmb{p}$, $L(\theta) = \| v(\theta)\|^2_2$, which is something that the Numerics of GANs paper recommended in the context of GANs. But, it replaces the Euclidean distance with KL divergence which makes more sense between probability distributions.</p>

<p>Again, the algorithm ignores the dependence of $\pmb{\pi}_\theta$ on $\theta$, and instead minimize <br>
$$
\mathcal{L}(\theta) = KL(\pmb{\pi} \| \pmb{p}_\theta) = \pmb{\pi}^T\log \pmb{p}_\theta - c,
$$</p>

<p>where $\pmb{\pi}$ was obtained by a previous value of the parameter $\theta_{\text{recent}}$, and $c$ is constant with respect to $\theta$. This is what AlpgaGo Zero optimizes in Eqn. (1) of the paper.</p>

<h2 id="connectionstoexpectationpropagation">Connections to Expectation-Propagation</h2>

<p>The Expectation-Propagation(EP) algorithm can be understood from the Minimal Policy Improvement viewpoint. There, instead of a RL policy, we wish to optimize an approximate posterior $q$ which is factorized as $q(\pmb{x}) = \prod q_i(\pmb{x}_{S_i})$. The real posterior $p(\pmb{x}) \alpha \prod p_i(\pmb{x}_{S_i})$ shares the same factorized structure, but its normalizing constant is intractable.</p>

<p>For EP, there are in fact multiple different improvement operators, one for each factor. For factor $j$, the improvement operator replaces the approximate factor $q_j$ with the real factor $p_j$ as follows:</p>

<p>$$
\pmb{\pi}^{j}(q) \propto \frac{p_j(\pmb{x}_{S_i})}{q_j(\pmb{x}_{S_j})}\prod q_i(\pmb{x}_{S_i})
$$</p>

<p>Within the class of distributions $q$ which share the same factor-graph structure as $p$, the following is true:</p>

<p>$$
\forall j: \pmb{\pi}^{j}(q) = p \iff p=q
$$</p>

<p>Therefore, it makes sense to seek stationary points with respect to the improvement operators, or in other words, to seek an approximate posterior which cannot be further improved by the above improvement operators.</p>

<p>When updating parameters of the $j^{\text{th}}$ factor $q_j$, we use the KL divergence to measure the size of the improvement - just like in AlphaGo Zero:</p>

<p>$$
\operatorname{KL}\left[\pmb{\pi}^{j}(q)\|q\right]
$$</p>

<p>Of course there is more to EP than this, but I think it's an interesting view on it. Thinking about it this way already opens up a whole new set of questions: what if we consider other improvement operators, for example, Markov-Chain Monte Carlo kernels?</p>

<p>Which brings us to the next connection I wanted to highlight: contrastive divergence.</p>

<h2 id="contrastivedivergence">Contrastive divergence</h2>

<p>Contrastive divergence also uses something like this minimal improvement principle, but somehow upside-down, in the context of generative modelling.</p>

<p>Let's say we have some training data $x_i$ drawn i.i.d. from some true data distribution $p$, and we want to fit to it an intractable generative model $q_\theta$. We can't evaluate the likelihood of the model $q_\theta$, but we can derive MCMC update, such that that irrespective of the starting state, applying the MCMC kernel several times, the distribution of samples converges to $q_\theta$.</p>

<p>We can therefore interpret each step of an MCMC sampling algorithm, too, as an improvement operator: Given samples from an arbitrary distribution $p$, if we apply one step of MCMC to the samples, the resulting distribution will look a bit more like samples drawn from the stationary distribution of the MCMC algorithm. Contrastive divergence exploits this and minimizes a kind of divergence between the data distribution $p$ and the distribution we obtain by running a fixed number of MCMC steps on the data samples.</p>

<p>For more details (and also an excellent hand-drawn Euler-method illustration) I recommend <a href="http://www.gatsby.ucl.ac.uk/~turner/Notes/ContrastiveDivergence/CDv3.pdf">Rich Turner's excellent notes</a> from, well, probably at least ten years ago.</p>

<h1 id="summary">Summary</h1>

<p>I really liked the idea of thinking about MCTS as an improvement operator, and to train the policy network to minimize the improvement achievabel by MCTS. Although the authors demonstrate other ideas - like using ResNets or using a joint network for policy and value - are important for good performance, I think this is the key idea that enables learning from pure self-play.</p>

<p>Sadly, I can't see this method being easy to apply beyond the scope of self-play though. Firstly, in Go the states are fully observed, which is not usually the case for RL. Secondly, MCTS only works so well because we can use our policy as a model of the opponent's policy, so we can do pretty accurate rollouts.</p>

<p>I'm not sure what an improvement operator would be for other RL tasks, even, say, for Atari games. If you have ides, please share.</p>]]></content:encoded></item><item><title><![CDATA[GANs are Broken in More than One Way: The Numerics of GANs]]></title><description><![CDATA[<p>Last year, when I was on a mission to "fix GANs" I had a tendency to focus only on what the loss function is, and completely disregard the issue of how do we actually find a minimum. Here is the paper that has finally challenged that attitude:</p>

<ul>
<li>Mescheder, Nowozin, Geiger</li></ul>]]></description><link>http://www.inference.vc/my-notes-on-the-numerics-of-gans/</link><guid isPermaLink="false">e318c34e-2aa3-4c8d-9036-dc3bcbfd730b</guid><dc:creator><![CDATA[Ferenc Huszar]]></dc:creator><pubDate>Thu, 05 Oct 2017 13:18:23 GMT</pubDate><media:content url="http://www.inference.vc/content/images/2017/10/ascending-and-descending-escher-1.jpg" medium="image"/><content:encoded><![CDATA[<img src="http://www.inference.vc/content/images/2017/10/ascending-and-descending-escher-1.jpg" alt="GANs are Broken in More than One Way: The Numerics of GANs"><p>Last year, when I was on a mission to "fix GANs" I had a tendency to focus only on what the loss function is, and completely disregard the issue of how do we actually find a minimum. Here is the paper that has finally challenged that attitude:</p>

<ul>
<li>Mescheder, Nowozin, Geiger (2017): <a href="https://arxiv.org/abs/1705.10461">The Numerics of GANs</a></li>
</ul>

<p>I reference <a href="http://blog.shakirm.com/2013/04/marrs-levels-of-analysis/">Marr's three layers of analysis</a> a lot, and I enjoy thinking about problems at the computational level: what is the ultimate goal we do this for? I was convinced GANs were broken at this level: they were trying to optimize for the wrong thing or seek equilibria that don't exist, etc. This is why I enjoyed f-GANs, Wasserstein GANs, instance noise, etc, while being mostly dismissive of attempts to fix things at the optimization level, like DCGAN or improved techniques (Salimans et al. 2016). To my defense, in most of deep learning, the algorithmic level is sorted: stochastic gradient descent. You can improve on it, but it's not broken, it doesn't usually need fixing.</p>

<p>But after reading this paper I had a revelation:</p>

<blockquote>
  <p>GANs are broken at both the computational and algorithmic levels.</p>
</blockquote>

<p>Even if we fix the objectives, we don't have algorithmic tools to actually find solutions.😱</p>

<h4 id="summaryofthispost">Summary of this post</h4>

<ul>
<li>as it is now expected of my posts, I am going to present the paper through a slightly different lens</li>
<li>I introduce the concept of conservative vs non-conservative vector fields, and highlight some of their properties </li>
<li>I then explain how consensus optimization proposed by Mescheder et al can be viewed in this context: it interpolates between a nasty non-conservative field and its well-behaved conservative relaxation</li>
<li>Finally, as it is also expected of my posts, I also highlight an important little detail, which was not discussed in the paper: how should we do all this in the minibatch setting?</li>
</ul>

<h2 id="introfromganstovectorfields">Intro: from GANs to vector fields</h2>

<p>GANs can be thought of as a two-player non-cooperative game. One player controls $\theta$ and wants to maximize its payoff $f(\theta, \phi)$, the other controls $\phi$ and seeks to maximize $g(\theta,\phi)$. The game is at a Nash equilibrium when neither player can improve its payoff by changing their parameters slightly. Now we have to design an algorithm to find such a Nash equilibrium.</p>

<p>It seems that there are two separate issues with GANs:</p>

<ol>
<li><strong>computational:</strong> no Nash equilibrium exists or it may be degenerate (<a href="https://arxiv.org/abs/1701.04862">Arjovsky and Bottou 2017</a>, <a href="https://arxiv.org/abs/1610.04490">Sønderby et al 2017</a>, <a href="http://www.inference.vc/instance-noise-a-trick-for-stabilising-gan-training/">blog post</a>).</li>
<li><strong>algorithmic:</strong> we don't have reliable tools for finding Nash equilibria (even though we're now pretty good at finding local optima)</li>
</ol>

<p><a href="https://arxiv.org/abs/1705.10461">Mescheder et al (2017)</a> address the second issue, and do it very elegantly. To find Nash equilibria, our best tool is simultaneous gradient ascent, an iterative algorithm defined by the following recursion:
$$
x_{t+1} \leftarrow x_t + h v(x_t), <br>
$$
where $x_t = (\theta_t, \phi_t)$ is the state of the game, $h$ is a scalar stepsize and $v(x)$ a the following vector field: <br>
$$
v(x) = \left(\begin{matrix}\frac{\partial}{\partial\theta}f(\theta, \phi)\\\frac{\partial}{\partial\phi}g(\theta, \phi)\end{matrix}\right). <br>
$$ </p>

<p>Here is an important observation that may look paradoxical at first: it is natural to think of GAN training as a special case of neural network training, but in fact, it is the other way around:</p>

<blockquote>
  <p>simultaneous gradient descent is a generalization, rather than a special case, of gradient descent</p>
</blockquote>

<h2 id="nonconservativevectorfields">Non-conservative vector fields</h2>

<p>One key difference between ordinary gradient descent and simultaneous gradient descent is that while the former finds fixed points of a conservative vector field, the latter has to deal with non-conservative fields. Therefore, I want to spend most of this blog post talking about this difference and what these terms mean.</p>

<p>A vector field on $\mathbb{R}^n$ is simply a function, $v:\mathbb{R}^n \rightarrow \mathbb{R}^n$, which takes a vector $x$ and outputs another vector $v(x)$ of the same dimensionality.</p>

<p>A vector field we often work with is the gradient of a scalar function, say $v(x) = \frac{\partial}{\partial x} \phi(x)$, where $\phi:\mathbb{R}^n \rightarrow \mathbb{R}$ can be a training objective, energy or loss function. These types of vector fields are very special. They are called conservative vector fields, which essentially translates to "nothing particularly crazy happens". Conservative fields and gradients of scalar functions are a one-to-one mapping: if and only if a vector field $v$ is conservative, is there a scalar potential $\phi$ whose gradient equals $v$.</p>

<p>Another vector field we often encounter in machine learning (but don't often think of it as a vector field) is the one defined by an autoencoder. An AE, too, takes as input some vector $x$ and returns another vector $v(x)$ which is the same size. Figure 5 of (Alain and Bengio, 2014) illustrates the vector field of denoising autoencoders trained on 2D data pretty nicely:</p>

<div class="image-div" style="width: 800px;">  
<img src="http://www.inference.vc/content/images/2017/10/Screen-Shot-2017-10-04-at-8.39.38-AM.png" alt="GANs are Broken in More than One Way: The Numerics of GANs">
</div>

<p>A vector field defined by an AE is not necessarily conservative, which basically means that weird things can happen. What kind of weird things? Let's have a look at an extreme example: the constant curl vector field $v(\theta,\phi) = (-\phi, \theta)$, a very typical example of a non-conservative vector field:</p>

<div class="image-div" style="width: 600px;">  
<img src="http://www.inference.vc/content/images/2017/10/animation1-3.gif" alt="GANs are Broken in More than One Way: The Numerics of GANs">
</div>

<p>This vector field arises naturally in the zero-sum game where $f(\theta, \phi)=-g(\theta, \phi)=-\theta\phi$. <a href="https://arxiv.org/abs/1606.03498">Salimans et al. (2016, Section 3)</a> mention the exact same toy example in the context of GANs. Notice how the vectors point around in circles, and the rotation in the field is visually apparent. Indeed, if you follow the arrows on this vector field (which is what simultaneous gradient descent does), you just end up going in circles, as shown.</p>

<p>Compare this vector field to Escher's impossible tower below. the servants think they are ascending, or descending, but in reality all they do is go around in circles. It is of course impossible to build Escher's tower as a real 3D object. Analogously, it is impossible to express the curl vector field as the gradient of a scalar function.</p>

<div class="image-div" style="width: 600px;">  
<img src="http://www.inference.vc/content/images/2017/10/ascending-and-descending-escher.jpg" alt="GANs are Broken in More than One Way: The Numerics of GANs">
</div>

<p>Even though the curl field has an equilibrium at $(\theta, \phi)=(0, 0)$, simultaneous gradient descent will never find it, which is bad news. While we got used to gradient descent always converging to a local minimum, simultaneous gradient descent does not generally converge to equilibria. It can just get stuck in a cycle, and momentum-based variants can even accumulate unbounded momentum and go completely berserk.</p>

<h2 id="consensusoptimizationtaminganonconvectorfield">Consensus optimization: taming a non-con vector field</h2>

<p>The solution proposed by Mescheder et al is to construct a conservative vector field from the original as follows: $-\nabla L(x) := -\frac{\partial}{\partial x}\| v(x) \|_2^2$. This is clearly conservative since we defined it as the gradient of a scalar function $L$. It is easy to see that this new vector field $-\nabla L$ has the same fix points as $v$. Below I plotted the conservative vector field $-\nabla L$ that corresponds to the curl field above:</p>

<div class="image-div" style="width: 600px;">  
<img src="http://www.inference.vc/content/images/2017/10/animation2-1.gif" alt="GANs are Broken in More than One Way: The Numerics of GANs">
</div>

<p>This is great. Familiar gradient descent on $L$ converges to a local minimum, which is a fix point of $v$. The problem now is, we have no control over what kind of fix point we converge to. We seek positive equilibria, but $-\delta L$ can't differentiate between saddle points or equilibria, or between negative or positive equilibria. This is illustrated on this slightly more bonkers vector field $v(\theta, \phi) = (cos(\phi), sin(\theta))$:</p>

<p><img src="http://www.inference.vc/content/images/2017/10/download-79-1.png" alt="GANs are Broken in More than One Way: The Numerics of GANs"></p>

<p>On the left-hand plot I annotated equilibria and saddle points. The middle plot shows the conservative relaxation $-\nabla L$ where both saddle points and equilibria are turned to local minima.</p>

<p>So what can we do? We can simply take a linear combination of the original $v$ and it's conservative cousin $-\nabla L$. This is what the combined (still non-conservative) vector field looks like for the curl field (see also the third panel of the figure above):</p>

<div class="image-div" style="width: 600px;">  
<img src="http://www.inference.vc/content/images/2017/10/animation3-2.gif" alt="GANs are Broken in More than One Way: The Numerics of GANs">
</div>

<p>By combining the two fields, we may end up with a slightly better-behaved, but still non-conservative vector field. One way to quantify how well a vector field behaves is to look at eigenvalues of its Jacobian $v'(x)$. The Jacobian is the derivative of the vector field, and for conservative fields it is called the Hessian, or second derivative. Unlike the Hessian, which is always symmetric, the Jacobian of non-conservative fields is non-symmetric, and it can have complex eigenvalues. For example, the Jacobian of the curl field is <br>
$$
v'(x) = \left[\begin{matrix}0&amp;-1\\1&amp;0\end{matrix}\right], <br>
$$
whose eigenvalues are entirely imaginary $+i$ and $-i$.</p>

<p>Mesceder et al show that by combining $v$ with $-\nabla L$, the eigenvalues of the combined field can be controlled (see paper for details), and if we choose a large enough $\gamma$, simultaneous gradient descent will converge to an equilibrium. Hooray!</p>

<p>Sadly, as we increase $\gamma$, we also introduce the spurious equilibria as before. These are equilibria of $v - \gamma\nabla L$ which are in fact only saddle points of $v$. So we can't just crank $\gamma$ up all the way, we have to find a reasonable middle-ground. This is a limitation of the approach, and it is unclear how much of a practical limitation it is.</p>

<h2 id="onemorethingstochasticminibatchvariant">One more thing: stochastic/minibatch variant</h2>

<p>I wanted to mention one more thing which is not discussed in the paper, but I think is important for anyone who would like to play with consensus optimization: how do we implement this with stochastic optimisation and minibatches? Although the paper deals with general games, in the case of a GAN, both $f$ and $g$ are in fact the average of some payoff over datapoints, e.g. $f(\theta, \psi) = \mathbb{E}_x f(\theta, \psi, x)$. Thus, for $L$ we have:</p>

<p>\begin{align}
L(x) &amp;= \left\| \frac{\partial f(\theta, \phi)}{\partial \theta} \right\|_2^2 + \left\| \frac{\partial g(\theta, \phi)}{\partial \theta} \right\|_2^2 \\ <br>
&amp;= \left\| \mathbb{E}_{x}\frac{\partial f(\theta, \phi, x)}{\partial \theta} \right\|_2^2 + \left\| \mathbb{E}_{x}\frac{\partial g(\theta, \phi, x)}{\partial \theta} \right\|_2^2
\end{align}</p>

<p>The problem with this is that it is not immediately trivial how to construct an unbiased estimator for this from a minibatch of samples. Simply taking the norm of the gradient for each datapoint and then averaging will give us a biased estimate, an upper bound by Jensen's inequality. Averaging gradients in the minibatch, and then calculating the norm is still biased, but it is a slightly tighter upper bound.</p>

<p>To construct an unbiased estimator, one can use the following identity: <br>
$$
\mathbb{E}[X]^2 = \mathbb{E}[X^2] - \mathbb{V}[X],
$$
where both the average norm and the population variance can be estimated in an unbiased fashion. I have discussed this with the authors, and I will ask them to leave a comment as to how exactly they did this in the experiment. They also promised to include more details about this in the camera ready version of the paper.</p>

<h2 id="summary">Summary</h2>

<p>I really liked this paper, it was a real eye opener. I always thought of the gradient descent-like algorithm we use in GANs as a special case of gradient descent, whereas in fact it is a generalisation. Therefore, the nice properties of gradient descent can't be taken for granted. The paper proposes a general, and very elegant solution. Good job. Let's hope this will fix GANs once and for all.</p>]]></content:encoded></item><item><title><![CDATA[From Instance Noise to Gradient Regularisation]]></title><description><![CDATA[<p>This is just a short note to highlight a new paper I read recently:</p>

<ul>
<li>Roth, Lucchi, Nowozin and Hofmann (2017) <a href="https://arxiv.org/abs/1705.09367">Stabilizing Training of Generative Adversarial Networks through Regularization</a></li>
</ul>

<h2 id="instancenoise">Instance noise</h2>

<p>Last year, <a href="https://arxiv.org/abs/1610.04490">we</a> and <a href="https://openreview.net/pdf?id=Hk4_qw5xe">Arjovsky &amp; Buttou (2017)</a> independently noticed a fundamental problem with the way GANs are trained, and</p>]]></description><link>http://www.inference.vc/from-instance-noise-to-gradient-regularisation/</link><guid isPermaLink="false">66d80393-1763-4ae6-8871-2ff823102a8f</guid><dc:creator><![CDATA[Ferenc Huszar]]></dc:creator><pubDate>Thu, 01 Jun 2017 12:54:24 GMT</pubDate><media:content url="http://www.inference.vc/content/images/2017/06/Screen-Shot-2017-06-01-at-1.43.40-PM.png" medium="image"/><content:encoded><![CDATA[<img src="http://www.inference.vc/content/images/2017/06/Screen-Shot-2017-06-01-at-1.43.40-PM.png" alt="From Instance Noise to Gradient Regularisation"><p>This is just a short note to highlight a new paper I read recently:</p>

<ul>
<li>Roth, Lucchi, Nowozin and Hofmann (2017) <a href="https://arxiv.org/abs/1705.09367">Stabilizing Training of Generative Adversarial Networks through Regularization</a></li>
</ul>

<h2 id="instancenoise">Instance noise</h2>

<p>Last year, <a href="https://arxiv.org/abs/1610.04490">we</a> and <a href="https://openreview.net/pdf?id=Hk4_qw5xe">Arjovsky &amp; Buttou (2017)</a> independently noticed a fundamental problem with the way GANs are trained, and proposed the same solution. The problem is, that when the support of the real data distribution $P$ and the generative model $Q$ are often non-overlapping, and in such situations the divergences we think GANs minimise are meaningless, and the discriminator can be perfect, which can lead to instabilities.</p>

<p>Both papers proposed adding Gaussian <a href="http://www.inference.vc/instance-noise-a-trick-for-stabilising-gan-training/">instance noise</a> to both real and fake samples, essentially to blow up distributions making sure there's meaningful overlap between their supports. Another way to think about this is that we're making the discriminator's job harder. Instance noise seems to stabilize GAN training but it also a bit dissatisfying in that the sampling Gaussian noise in a very high dimensional space introduces additional Monte Carlo noise which might slow down training (although arguably I'd prefer a slow but stable training to a quicker but quasi-random process that never converges).</p>

<h2 id="analyticinstancenoise">Analytic instance noise</h2>

<p>What <a href="https://arxiv.org/abs/1705.09367">(Roth et al, 2017)</a> propose is essentially an analytical treatment of instance noise in the f-GAN framework. As I explained in <a href="http://www.inference.vc/instance-noise-a-trick-for-stabilising-gan-training/">this post</a>, instance noise can be understood as changing the divergence you use to quantify mismatch between $P$ and $Q$. If $d$ is your original divergence, instance noise changes the divergence to $d_\sigma$:</p>

<p>$$
d_\sigma[Q\|P] = d[Q \ast\mathcal{N}_\sigma \| P\ast\mathcal{N}_\sigma] <br>
$$</p>

<p>[(Roth et al, 2017)] showed that, if $d$ is an f-divergence, it is possible to approximately minimise $d_\sigma$ directly without adding noise to the samples. Instead, you have to regularise the norm of certain discriminator gradients at the fake datapoints. The relationship between additive noise and gradient regularisation is not entirely surprising: it's very similar to the denoising - score matching connection, and it can be derived similarly from a Taylor approximation of the discriminator around the datapoint.</p>

<p>For details, and there are plenty of details I glanced over here, I recommend you read the paper. Though the maths is a bit dense, I think it's worth it.</p>

<p>Here I just took the toy example from Figure 1 which illustrates how the regularization helps the network converge to a good solution quicker, and then stay there in a stable manner:</p>

<p><img src="http://www.inference.vc/content/images/2017/06/Screen-Shot-2017-06-01-at-1.41.21-PM.png" alt="From Instance Noise to Gradient Regularisation"></p>]]></content:encoded></item></channel></rss>