inFERENCe

The Future of Software

Ferenc Huszar — Wed, 25 Feb 2026 16:27:41 GMT

The world of software is undergoing a shift not seen since the advent of compilers in the 1970s. Compilers were the original vibe coding: they automatically generate complex machine code that human programmers had to manually write before. Over time, compilers became fully trusted, nobody has to look under the hood, most programmers won't understand a thing. Are AI coding agents the new compilers? Will we simply trust whatever code they generate?

In this post I focus on two questions:

In what language(s) are we going to express our intent? How will humans tell AI agents what software artefacts we would like to create? Is English good enough?
How we will come to trust code written by AI? Code generation is close to being automated, verification of what has been built is the current bottleneck. Inevitably, this verification bottleneck will fall. But how?

Here is a sneak peek of a likely future I envision. Read on for more detail:

Machine Code

I've recently read (or, rather, listened to) Bill Gates' memoir the Source Code. In it, Gates details his first foray into programming, and it quickly becomes blatantly clear that programming in those days meant something much, much more difficult.

Gates started programming in assembly, a human-readable version of binary machine code. This is a very close to the metal instruction set, in which you manipulate key functions of the CPU very directly. Writing in this language is very cognitively demanding, and error-prone. Machine code completely lacks the ergonomy, safeguards, modularity offered by modern programming languages. Here is a program listing from the book as an example:

The Fate of Machine Code and Assembly

One of the first successful products that Microsoft sold was not an operating system, it was a compiler. The compiler allowed programmers to instruct the machines in BASIC, a high-level programming language co-developed by Hungarian-born John Kemény and Thomas Kurtz.

Below is a little three line example of a BASIC program, and its output:

BASIC program and its output (source: Wikipedia)

This was a game changer! It massively democratised computer programming. This still required abstract thinking but far from the learning curve of assembly that was insurmountable for most people.

What happened to machine code? It is still there. It did not get replaced by BASIC, the BASIC compiler, or any other compiler for that matter, generates it under the hood. To people transitioning from Assembly to BASIC, this must have felt like vibe coding.

Machine code remained there, just nobody had to write or read it anymore. The compiler does all that, and we trust compilers to take care of it.

What fate awaits today's high level programming languages? Will they disappear with AI agents writing machine code directly, as Elon Musk predicts? I don't think so. Programming languages offer very useful safeguards and benefits beyond ergonomy. But it is likely they will be abstracted away, just like machine code is, with the need for human attention to this level gradually decreasing.

This future of software creation, in which our programming languages are abstracted away, raises two very important questions:

What will the instruction/specification language look like? Assuming people will continue to want to control computers, in what language will we express our intention?
How can we trust the layer below that specification language? We came to trust compilers almost entirely. Can we ever trust AI agents writing in high-level code to the same degree? What do we need to get there?

What language will we "program" in?

"The hottest new programming language is English" - Andrej Karpathy

Andrej may have been half-joking when he tweeted this a few years back, but it is now clear that there is at least some truth in that prediction even if it wasn't meant to be a serious prediction. In many applications of programming, especially in low-stakes prototypes, simple web apps or data science, we may have reached the point where specifying software behaviour in English is good enough. Attempts like CodeSpeak try to cultivate a serious version of this idea.

But English won't be a sufficient description language for all software, especially in higher-stakes situations where accountability matters. Natural language simply lacks precision and structure to describe things that really matter without ambiguity. It's sufficient to look at the amount of energy going into arguing about interpretations of legal text to understand why. Entire wars would have been prevented if the Bible got written in a strongly typed formal specification language instead of unclear natural language.

So in addition to English specifications and prompts, formal specification languages will also survive. These will have properties of programming languages (a clear grammar for syntax, unambiguous mapping for symbols to meaning) but will allow developers specify important high-level properties of a system compactly. Something like: "We are a bank, this is our ledger, money should not magically disappear from the ledger" but expressed with unambiguous mathematical clarity.

This formal specification for high-stakes software properties should be something a human developer or team can be held accountable for. This should be a clear layer of separation between the developer and the pool of AI agents working to maintain software. When something goes wrong, it should be unambiguous to determine whether a failure is a failure of specifying intent, or a failure of the AI to interpret the intent and satisfy the specification.

How Will we Trust AI Coding?

Once we have a high-level specification language, we have to get to a point where we can trust what happens "underneath". This trust is similar to how we trust compilers to generate machine code we don't even understand and never have to look at. It's actually amazing how much trust we have in compilers. Where does this trust come from?

Compilers are - mostly - deterministic, rules-based and modular programs. Developed bit by bit by humans, it's actually possible to understand what a compiler does, what components it has, what optimizations it performs, what it prioritises in decision making. It is interpretable, understandable, provided you have the time.

LLMs and LLM-based agents are famously not interpretable. It is a miracle they can produce meaningful source code at all. We don't know what sort of optimizations they do, or how they make decisions, and we can make no argument about correctness other than empirical evaluations. Anthropic doesn't say, because Anthropic likely doesn't know either. It's just something we have to live with. It would be nice if they were interpretable, but they aren't, and it will take decades before that fundamentally changes, if at all.

Our trust in AI agents will fundamentally have to come from a different source. Compilers are trusted by construction, trust in coding agents must instead rely on post-hoc reasoning about correctness. This is actually analogous to the "explainability by construction" vs "post-hoc explainability" shift in deep learning.

While we can't explain how an AI agent generates code, and we therefore cannot guarantee correctness at all times, we can reason about the correctness of the specific code once it is written. What we need then is a reliable way to check that the code generated actually fulfils the requirements in the formal specification. Something that needs no human involvement, and which can produce a "green checkmark" of correctness. Tools that allow us to do this with various levels of reliability are formal verification via formal reasoning, property-based testing, and runtime verification.

All of these are tools that had been somewhat niche, but it's clear that high-speed agentic software development is creating an urgent and obvious need for them. The good news is, superhuman LLMs and AI agents also make these tools more accessible in mainstream applications. Writing proofs in a formal reasoning system is a pretty hard, niche skill, but also something that LLMs continue to improve at rapidly each month.

My Summary in a Figure

To sum up the key points, I argued:

For the most important business logic, English is insufficient. Formal specification languages will be used, so developers/organisations can be audited and held accountable for the software systems they create and maintain.
Automated proof artefacts will be created that verify that the implementation agrees with the specification.
The future of agentic software engineering is a multi-agent one, while some agents maintain/improve the implementation, other agents work to update the proof artefacts to maintain reliability and accountability.

Hopefully, with these arguments, the figure of the day makes sense. Compilers were trusted by construction (they are interpretable programs), AI agents (uninterpretable black boxes) will be trusted through post-hoc verification.

And, no surprises: this is precisely what we're working on at our new company Reasonable. If building this future sounds exciting, and you want to contribute, feel free to reach out or check out our open roles and internships.

Deep Learning is Powerful Because It Makes Hard Things Easy - Reflections 10 Years On

Ferenc Huszar — Sat, 31 Jan 2026 15:21:52 GMT

Ten years ago this week, I wrote a provocative and bold post that blew up, made it to top spot on HackerNews. I had just joined Magic Pony, a nascent startup, and I remember the founders Rob and Zehan scolding me for offending the very community we were part of, and - of course - deep learning developers we wanted to recruit.

Deep Learning is Easy - Learn Something Harder

Caveat: This post is meant address people who are completely new to deep learning and are planning an entry into this field. The intention is to help them think critically about the complexity of the field, and to help them tell apart things that are trivial from things that are…

inFERENCeFerenc Huszar

The post aged in some very, hmm, interesting ways. So I thought it would be good to reflect on what I wrote, things I got very wrong, and how things turned out.

🤡 Hilariously poor predictions on low hanging fruit and the impact of architecture tweaks
🎯 some insightful thoughts on how simplicity = power
🤡 predictions on development of Bayesian deep learning, MCMC
🎯 some good advice nudging people to generative models
⚖️ PhD vs company residency: What I think now?
🍿 Who is Wrong Today? Am I wrong? Are We All Wrong?

Let's start with the most obvious blind spot in hindsight:

🤡 's predictions on architecture and scaling

There is also a feeling in the field that low-hanging fruit for deep learning is disappearing. [...] Insight into how to make these methods actually work are unlikely to come in the form of improvements to neural network architectures alone.

Ouch. Now this one has aged like my great-uncle-in-law's wine (He didn't have barrels so he cleaned up an old wheelie bin to serve as fermentation vat). Of course today, 40% of people credit the transformer architecture for everything that's going on, 60% credit scaling laws which are essentially existence proofs of stupendously expensive low hanging fruit.

But there's more I didn't see back then: I - and others - wrote a lot about why GANs don't work, and how to understand them better, and how to fix them using maths. Eventually, what made them work well in practice was ideas like BigGAN, which mostly used architectural tweaks rather than foundational mathematical changes. On the other hand, what made SRGAN work was thinking deeply about the loss function and making a fundamental change - a change that has been universally adopted in almost all follow-on work.

In general, lots of beautiful ideas were steamrolled by the unexplainably an unreasonably good inductive biases of the simplest methods. I - and many others - wrote about modeling invariances and sure, geometric deep learning is a thing in the community, but evidence is mounting that deliberate, theoretically inspired model designs play a limited role. Even something as successful as the convolution, once thought of as indispensable for image processing, is at risk of going the way of the dodo - at least at the largest scales.

In hindsight: There is a lot of stuff in deep learning that we don't understand nearly enough. Yet they work. Some simple things have surprisingly huge impact, and mathematical rigour doesn't always help. The bitter lesson is bitter for a reason (maybe it was the wheelie bin). Sometimes things work for reasons completely unrelated to why we thought they would work. Sometimes people are right for the wrong reason. I was certainly wrong, and for the wrong reason, multiple times. Have we run out of low hanging fruit now? Are we entering "the era of research with big compute" as Ilya said? Is Yan LeCun right to call LLMs a dead end today? (Pop some 🍿 in the microwave and read till the end for more)

🎯 "Deep learning is powerful exactly because it makes hard things easy"

Okay, this was a great insight. And good insight is often perfectly obvious in hindsight. The incredible power of deep learning, defined as the holy trinity of automated differentiation, stochastic gradient descent and GPU libraries is that it took something PhD students did, and turned it into something 16-year-olds can play with. They don't need to know what a gradient is, not really, much less implement them. You don't need to open The Matrix Cookbook a hundred times a day to remember which way the transpose is supposed to be.

At the start of my career, in 2007, I attended a Machine Learning Summer School. It was meant for PhD students and postdocs. I was among the youngest participants, only a Master's student. Today, we run AI retreats for 16-18 year olds who work on projects like RL-based solutions to the no-three-in-lines program, or testing OOD behaviour of diffusion language models. Three projects are not far from publishable work, one student is first author on a NeurIPS paper, though I had nothing to do with that.

In hindsight: the impact of making hard things easy should not be underestimated. This is where the biggest impact opportunities are. LLMs, too, are powerful because they make hard things a lot easier. This is also our core thesis at Reasonable: LLMs will make extremely difficult types of programming - ones which kind of needed a specialised PhD to really understand - "easy". Or at least accessible to mortal human software engineers.

🤡 Strikes Again? Probabilistic Programming and MCMC

OK so one of the big predictions I made is that

probabilistic programming could do for Bayesian ML what Theano has done for neural networks

To say the least, that did not happen (if you're wondering, Theano was an early deep learning framework, precursor to pytorch and jax today). But it was an appealing idea. If the main thing with deep learning is that it democratized "PhD level" machine learning by hiding complexity under lego-like simplicity, wouldn't it be great to do just that with the even more PhD level topic of Bayesian/probabilistic inference? Gradient descent and high dimensional vectors are hard enough to explain to a teenager but good luck explaining KL Divergences and Hamiltonian Monte Carlo. If we could abstract these things out the same way, and unlock their power, it could be great. Well, we couldn't abstract things to the same degree.

In hindsight: Commenters called it self-serving of me to predict that areas in which I had expertise in will happen to be the most important topics to work the future. And they were right! My background in information theory and probabilities did turn out to be pretty useful, but it took me some time to let go of my Bayesian upbringing. I have reflected on this in my post on secular Bayesianism in 2019.

🎯 Generative Modeling

In the post I suggested people learn "something harder" instead of - or in addition to - deep learning. One of those areas I encouraged people to look at was generative modelling. I gave GANs and Variational Autoencoders as examples. Of course, neither of these play role in LLMs, arguably the crown jewels of deep learning. Furthermore, generative modelling in autoregressive models is actually super simple, can be explained without any probabilistic language as simply "predicting the next token".

In hindsight: Generative modelling continues to be influential, and so this wasn't at least super bad advice to tell people to focus on it in 2016. Diffusion models, early versions of which were emerging by 2015, power most image and video generative models today, and diffusion language models may one day be influential, too. Here, at least it is true that deeper knowledge of topics like score matching, variational methods came in handy.

⚖️ PhD vs Company Residency

On this interesting topic, I wrote

A couple companies now offer residency programmes, extended internships, which supposedly allow you to kickstart a successful career in machine learning without a PhD. What your best option is depends largely on your circumstances, but also on what you want to achieve.

I wrote this in 2015. If you went and did a PhD in Europe (lasting 3-4 years) starting then, assuming you're great, you would have done well. You graduated just in time to see LLMs unfold - didn't miss too much. Plus, you would likely have done one interesting internship every single summer of your degree. But things have changed. Frontier research is no longer published. Internships at frontier labs are hard to get unless you're in your final year and the companies can see a clear path of hiring you full time. Gone are the days of publishing papers as an intern.

In the frontier LLM space, the field is so fast moving that it's truly difficult to pick a research question there that won't look obsolete by the time you write your thesis. If you pick something fundamental and ambitious enough - say adding an interesting form of memory to LLMs - your lab will likely lack the resources to demonstrate it at scale, and even if your idea is a good one, by the time you're done, the problem will be considered "essentially solved" and people start copying whatever algorithm DeepSeek or Google happened to talk about first. Of course, you can choose to not engage with the frontier questions and do something

Times have changed. Depending on what your goals, interests are and what you're good at, I'm not so sure a PhD is the best choice. And what's more! I claim that

most undergraduate computer science programs, even some elite ones, fail to match the learning velocity of the best students.

I'm not saying you should skip a rigorous degree program. My observation is that top talent can and do successfully engage with what was considered graduate-level content in their teenage years. While back then I was deeply skeptical on 'college dropouts' and the Thiel fellowship, my views have shifted significantly after spending time with brilliant young students.

🍿 Section: Are We Wrong Today?

The great thing about science is that scientists are allowed to be wrong. Progress happens when people take different perspectives, provided we admit we were wrong and update on evidence. So here you have it, obviously

I was wrong on a great many things.

But this raises questions: Where do I stand today? Am I wrong today? Who else is wrong today? Which position is going to look like my 2016 blog post in retrospect?

In 2016 I warned against herd mentality of "lego-block" deep learning. In 2026, I am marching with the herd. The herd, according to Yann LeCun, is sprinting towards a dead end, mistaking the fluency of language models with a true foundation for intelligence.

Is Yann LeCun right to call LLMs a dead end? I recall that Yann's technical criticism of LLMs started with a fairly mathematics-based theoretical argument about how errors accumulate, and autoregressive LLMs are exponentially diverging diffusers. Such an argument was especially interesting to see from Yann, who likes to remind us that naysayers doubted neural networks and have put forward arguments like "they have too many parameters, they will overfit" or "non-convex optimization gets stuck in local optima". Arguments that he blamed for standing in the way of progress. Like others, I don't now buy these arguments.

What is the herd not seeing? According to Yann, true intelligence requires an understanding of the physical world. That in order to achieve human level intelligence, we first have to have cat or dog level intelligence. Fair enough. There are different aspects of intelligence and LLMs only capture some aspects. But this is not reason enough to call them a dead end unless the goal is to create something indistinguishable from a human. A non-embodied, language-based intelligence has an infinitely deep rabbit-hole of knowledge and intelligence to conquer: an inability to catch a mouse or climb a tree won't prevent language-based intelligence to have profound impact.

On other things the herd is not seeing, Yann argues true intelligence needs "real" memory, reasoning and planning. I don't think anyone disagrees. But why could these not be built on or plugged into the language model substrate? It's no longer true that LLMs are statistical pattern matching devices that learn to mimic what's on the internet. Increasingly, LLMs learn from exploration, reason and plan pretty robustly. Rule learning, continuous learning and memory are on top of the research agenda of every single LLM company. These are going to get done.

I celebrate Yann going out there to make and prove his points, and wish him luck. I respect him and his career tremendously, even as I often find myself taking a perspective that just happens to be in anti-phase to his - as avid readers of this blog no doubt know.

But for now, I'm proudly marching with the herd.

Discrete Diffusion: Continuous-Time Markov Chains

Ferenc Huszar — Thu, 22 May 2025 09:12:09 GMT

This post, intended to be the first in a series related to discrete diffusion models, has been sitting in my drafts for months. I thought that Google's release of Gemini Diffusion might be a good occasion to finally publish it.

While discrete time Markov chains - sequences of random variables in which the past and future are independent given the present - are rather well known in machine learning, fewer people ever come across their continuous cousins. Given that these models feature in work on discrete diffusion models (see e.g. Lou et al, 2023, Sahoo et al, 2024, Shi et al, 2024, obviously not a complete list), I thought it would be a good way to get back to blogging by writing about these models - and who knows, maybe carry on writing a series about discrete diffusion. For now, the goal of this post is to help you build some key intuition about how continuous-time MCs work.

A Markov chain is a stochastic process (infinite collection of random variables indexed by time $t$) defined by two properties:

the random variables take discrete values (we call them states), and
the process is memory-less: What happens to the process in the future only depends on the state it is in at the moment. Mathematically, $X_{u} \perp X_s \vert X_t$ for all $s > t > u$, where $\perp$ denotes conditional independence.

We differentiate Markov chains on values the index $t$ can take: if $t$ is an integer, we call it a discrete-time Markov chain, and when $t$ is real, we call the process a continuous-time Markov chain.

Discrete Markov chains are fully described by a collection of state transition matrices $P_t = [P(X_{t+1} = i\vert X_{t}=j)]{i,j}$. Note that this matrix $P_t$ is indexed by the time $t$, as it can, in general, change over time. If $P_t = P$ is constant, we call the Markov chain homogeneous.

Continuous time homogeneous chains

To extend the notion of MC to continuous time, we are first going to develop an alternative model of a discrete chains, by considering waiting times in a homogeneous discrete-time Markov chain. Waiting times are the time the chain spends in the same state before transitioning to another state. If the MC is homogeneous, then in every timestep it has a fixed probability $p_{i,i}$ of staying there. The waiting time therefore follows a geometric distribution with parameter $p_{i,i}$.

The geometric distribution is the only discrete distribution with the memory-less property, stated as $\mathbb{P}[T=s+t\vert T>s] = \mathbb{P}[T = t]$. What this means is: if you are at time s, and you know the event hasn't happened yet, the distribution of the remaining waiting time is the same, irrespective of how long you've been waiting. It turns out, the geometric distribution is the only discrete distribution with this property.

With this observation, we can alternatively describe a homogeneous discrete-time Markov chain in terms of waiting times, and jump probabilities as follows. Starting from state $i$ the Markov chain:

stays in the same state for a time drawn from a Geometric distribution with parameter $p_{i,i}$
when the waiting time expires, we sample a new state $i \neq j$ with probability $\frac{p_{i,j}}{\sum_{k\neq i} p_{i,k}}$

In other words, we have decomposed the description of the Markov chain in terms of when the next jump happens, and how the state changes when the jump happens. In this representation, if we want something like a Markov chain in continuous time, we can do that by allowing the waiting time to take real, not just integer, values. We can do this by replacing the geometric distribution by a continuous probability distribution. To preserve the Markov property, however, it is important that we preserve the memoryless property $\mathbb{P}[T=s+t\vert T>s] = \mathbb{P}[T = t]$. There is only one such continuous distribution: the exponential distribution. A homogeneneous continuous-time Markov chain is thus described as follows. Starting from state $i$ at time $t$:

stay in the same state for a time drawn from an exponential distribution with some parameter $\lambda_{i,i}$
when the waiting time expires, sample a new state $i \neq j$ with probability $\frac{\lambda_{i,j}}{\sum_{k\neq i} \lambda_{i,k}}$

Notice that I introduced a new set of parameters $\lambda_{i,j}$ which now replaced the transition probabilities $p_{i,j}$. These no longer have to be probability distributions, they just have to be all positive real values. The matrix containing these parameters will be called the rate matrix, which I will denote by $\Lambda$, but I'll note that in recent machine learning papers, they often use the notation $Q$ for the rate matrix.

Non-homogeneous Markov chains and point processes

The above description only really works for homogeneous Markov chains, in which the waiting time distribution does not change over time. When the transition probabilities can change over time, the wait time can no longer be described as exponential, and it's actually not so trivial to generalise to continuous time in this way. To do that, we need yet another alternative view on how Markov chains work: we instead consider their relationship to point processes.

Point processes

Consider the following scenario which will illustrate the relationship between the discrete uniform, the Bernoulli, the binomial and the geometric distributions.

The Easter Bunny is hiding eggs in 50 meter long (one dimensional) garden. In every 1 meter segment he hides at most one egg. He wants to make sure that on average, there is a 10% chance of an egg in each 1 meter segment, and he also wants the eggs to appear random.

Bunny is considering a few different ways of achieving this:

Process 1: Bunny steps from square to square. At every square, he hides an egg with probability 0.1, and passes to the next square.

bernoulli = partial(numpy.random.binomial, n=1)
garden = ['_']*50

for i in range(50):
  if bernoulli(0.1)
    garden[i] = '🥚'

print(''.join(garden))

>>> __________🥚__🥚______🥚_🥚__________________🥚________

Simulating the egg-hiding process using Bernoulli distributions

Process 2: At each step Bunny draws a random non-negative integer $T$ from a Geometric distribution with parameter 0.1. He then moves T cells to the right. If he is still within bounds of the garden, he hides an egg where he is, and repeats the process until he is out of bounds.

garden = ['_']*50
bunny_pos = -1

while bunny_pos<50: 
  bunny_pos += geometric(0.1)
  if bunny_pos<50:
    garden[bunny_pos] = '🥚'

print(''.join(garden))

>>> 🥚______🥚_________________🥚______________🥚🥚🥚___🥚_🥚🥚

Simulating the egg-hiding process using Geometric distributions

Process 3: Bunny first decides how many eggs he's going to hide in total in the garden. Then he samples random locations (without replacement) from within the garden, and hides an egg at those locations:

garden = ['_']*50
number_of_eggs = binomial(p=0.1, n=50)
random_locations = permutation(np.arange(50))[:number_of_eggs]

for location in random_locations:
  garden[location] = '🥚'

print(''.join(garden))

>>> ________🥚_____🥚_______________🥚_🥚______🥚________🥚🥚

It turns out it does not matter which of these processes Bunny follows, at the end of the day, the outcome (binary string representing presence or absence of eggs) follows the same distribution.

What Bunny does in each of these processes, is he simulated a discrete-time point process with parameter $p=0.1$. All these various processes are equivalent representations of the point process:

a binary sequence where each digit is sampled from independent Bernoulli
a binary sequence where the gap between 1s follows a geometric distribution
a binary sequence where the number of 1s follows a binomial distribution, and the locations of 1s follow a uniform distribution (with the constraint that they are not equal).

Continuous point processes

And this egg-hiding process does have a continuous limit, one where the garden is not subdivided into 50 segments, but is instead treated as a continuous segment along which eggs may appear anywhere. Like good physicists - and terrible parents - do, we will consider point-like Easter eggs that have size $0$ and can appear arbitrarily close to each other.

While process 1 seems inherently discrete (it loops over enumerable locations) both Process 2 and Process 3 can be made continuous, by replacing the discrete probability distributions with continuous cousins.

In Process 2 we replace the geometric with the, also memoryless, exponential.
In Process 3 we replace the binomial with its limiting Poisson distribution, and the uniform sampling without replacement by a continuous uniform distribution over the space. We can drop the constraint that the locations have to be different, since this holds with probabilty one for continuous distributions.

In both cases, the collection of random locations we end up with is what is known as a homogeneous Poisson point process. Instead of a probability $p$, this process will now have a positive parameter $\lambda$ that controls the expected number of points the process is going to place in the interval.

Point processes and Markov Chains

Back to Markov chains. How are point processes and Markov chains related? The appearance of exponential and geometric distributions in Process 2 of generating point processes might be a giveaway, the waiting times in the Markov chains will be related to the waiting times between subsequent points in a random point process. Here is how we can use point processes to simulate a Markov chain. Starting from the discrete case let's denote by $p_{i,j}$ the transition probability from state $i$ to state $j$

Step 1: For each of the $N(N-1)$ possible transitions $i \rightarrow j$ between $N$ states, draw an independent point processes with parameter $p_{i,j}$. We now have a list of of event times for each transition. These can be visualised for a three-state Markov Chain as follows:

For each transition i->j, we draw transition events from a point process with parameter corresponding to the transition probability

This plot has one line for each transition from a state $i$ to another state $j$. For each of these we drew transition points from a point process. Notice how some of the lines are denser - this is because the corresponding transition probabilities are higher. This is the transition matrix I used:

To turn this set of point processes into a Markov chain, we have to repeat the following steps starting from time 0.

Assume at time $t$ the Markov chain is in state $i$. We consider the point processes for all transitions out of state $i$, and look at the earliest event in any of these sequences that happen after time $t$. Suppose this earlier event is at time $s+t$ and happens in the point process we drew for transition $i \rightarrow j$. Our process will then stay in state $i$ until time $s+t$, then we jump to to state $j$ and repeat the process.

This is illustrated in the following figure:

We start the Markov chain in state $1$. We therefore focus on the transitions $1\rightarrow 0$ and $1 \rightarrow 2$, as highlighted by the green area on the left side of this figure. We stay in state $1$ until we encounter a transition event (black dot) for either of these two transitions. This happens after about 5 timesteps, as illustrated by the red cross. At this point, we transition to state $2$, since the first transition event was observed in the $1 \rightarrow 2$ row. Now we focus on transition events for lines $2\rightarrow 1$ and $2\rightarrow 0$, i.e. the top two rows. We don't have to wait too long until a transition event is observed, triggering a state update to state $0$, and so on. You can read off the Markov chain states from where the green focus area lies.

This reparametrisation of a CTMC in terms of underlying Poisson point processes now opens the door for us to simulate non-homogeneous CTMCs as well. But, as this post is already pretty long, I'm not going to tell you here how to do it. Let that be your homework to think about: Go back to process 2 and process 3 of simulating the homogeneous Poisson process. Think about what makes them homogeneous (i.e. the same over time)? Which components would change if the rate matrix wasn't constant over time? Which representation can handle non-constant rates more easily?

Summary

In this post I attempted to convey some crucial, and I think cool, intuition about continuous time Markov chains. I focussed on exploring different representations of a Markov chain, each of which suggest a different way of simulating or sampling from the same process. Here is a recap of some key ideas discussed:

You can reparametrise a homogeneous discrete Markov chain in terms of waiting times sampled from a geometric distribution, and transition probabilities.
The geometric distribution has a memory-less property which has a deep connection with the Markov property of Markov chains.
One way to go continuous is to consider a continuous waiting time distribution which preserves the memorylessness: the exponential distribution. This gave us our first representation of a continuous-time chain, but it's difficult to extend this to non-homogeneous case where the rate matrix may change over time.
We then discussed different representations of discrete point processes, and noted (although did not prove) their equivalence.
We could consider continuous versions of these point processes, by replacing discrete distributions by continuous ones, again preserving crucial features such as memorylessness.
We also discussed how a Markov chain can be described in terms of underlying point processes: each state pair has an associated point process, and transitions happen when a point process associated with the current state "fires".
This was a redundant representation of the Markov chain, inasmuch as most points sampled from the underlying point processes did not contribute to the state evolution at all (in the last figures, most black dots lie outside the green area, so you can move their locations without effecting the states of the chain).

We may finally crack Maths. But should we?

Ferenc Huszar — Thu, 08 Jun 2023 15:58:13 GMT

Automating mathematical theorem proving has been a long standing goal of artificial intelligence and indeed computer science. It's one of the areas I became very interested in recently. This is because I feel we may have the ingredients needed to make very, very significant progress:

a structured search space with clear-cut success criterion that can be algorithmically generated: the language of formal mathematics
a path to obtaining very good heuristics to guide search in the space - LLMs trained on a mixture of code, formal and informal mathematics.
learning algorithms that can exploit the above, like AlphaZero and MuZero, with demonstrated ability of tackling some tricky search problems (in Go, and now with AlphaDev).

Wicked and Tame problems

Some problems humans have to solve are just fundamentally harder than others. To reason about this, Rittel and Webber (1973) defined the concept of wicked and tame problems in “Dilemmas in a General Theory of Planning”. Wicked problems have the following characteristics:

they elude definitive formulation
there is no clear stopping criterion, i.e. it’s impossible to tell if a solution has been reached
solutions are not true-or-false but instead “good-or-bad”
possible solution candidates cannot be enumerated or exhaustively described
there is no opportunity to learn from trial-and-error either because the problem deals with one-off situations or because any solution attempt may incur significant costs

Most policy problems - where to put a highway, should our country go to war, how should we regulate AI, should Britain leave the European Union - are wicked problems. So are most business, artistic or lifestyle dilemmas we’re faced with. One can't simply use BOTECs, decision theory and optimisation to capture all aspects of these problems.

By contrast, tame problems are the exact opposite: well defined problems, with clear stopping criteria, where the solution space is exhaustively describable, and trial-and-error learning is possible. Many problems in mathematics, puzzle-solving, engineering and the natural sciences, playing chess and some other sports fall into this category, according to Ritter and Webber.

To me, this classification almost reads like:

tame = solvable by AI
wicked = requires human judgement, accountability, responsibility

Mathematical theorem proving is a clear example of a tame problem. Notwithstanding some intricacies, theorem proving is a pretty well defined problem, and especially in formal mathematics, in languages like LEAN, Isabel and others, the possible attempts at proving a statement are exhaustively describable. For the most part, it can be determined if a solution has been reached, and solutions are either true or false. Finally, getting a proof wrong has no cost or negative consequences attached to (other than perhaps energy or time cost), so the problem is amenable to trial-and-error learning.

Of course, there is more to mathematics than the generation of formally verified proofs of statements in an arguably ugly and hard-to-read language based on dependent type theory. There is conjecture generation: how do we discover statements that are likely provable. Some proofs have aesthetic value, or can be described as beautiful. Some theorems or definitions are more useful than others, as they may have implications or might make the discovery of other theorems easier. All of these aspects notwithstanding, I argue the core problem of formal theorem proving passes Ritter and Webber’s definition of a tame problem, and thus machine intelligence may be able to deal with the problem without any human judgment in the loop.

Tame Doesn’t Mean Easy

It’s important thought to note that tame doesn’t mean easy, nor does it mean it will be solved. Indeed, a convincing solution to general automated theorem proving eluded the field for decades. I believe the correct conclusion from this analysis is that it is ‘within reach’ for machine intelligence, and new evidence on the unreasonable efficiency of code generating LLMs points at new avenues that may result in significant breakthrough. But many, great difficulties remain:

the search space is vast: The search space of all synthactically correct proofs is obviously enormous, up there with Go, and probably even larger/slightly less well structured.
the reward signal is sparse: In two-player games like Go, one of the players always wins, and thus gets positive reinforcement after an episode of self-play. Mathematical theorem proving, and code generation in general, is a single-player game. Most proof attempts fail. Positive reinforcement only comes if the proof is verifiably correct, and that happens very rarely.
data is - currently - limited: While there are plenty of data out there on informal mathematics (mathematical statements and proofs written in human language), as well as code in python or JavaScript, there is relatively little data on the internet containing formally verified software or formal mathematical proofs. LEAN's matlib is probably the largest repository of formal mathematical proofs, anecdotally it covers all of the Imperial College undergraduate curriculum, but it's still but a tiny sliver of mathematical knowledge humans developed. There are attempts to leverage LLMs to help automatically formalize informal mathematics thus creating more data, but it's still an open question how much of the data bottleneck can be eliminated this way.

What if we do crack this

So, let’s assume for a minute that we can crack the mathematical theorem proving puzzle, and in particular, we will build programs capable of generating correct proofs of non-trivial mathematical statements, and perhaps even generating new mathematical statements together with correct proofs. What are the implications of this?

Dual use

AI systems that do useful things can often be repurposed accomplish very bad things. For example, AI methods designed to help with safe drug design can also be used to design deathly toxic compounds simply by switching a sign in an objective function, as shown in Dual-use of artificial-intelligence-powered drug discovery.

So what would the paper ”Dual-use of artificial-intelligence-powered mathematical theorem proving” be about? I’m not sure it’s a question many folk working on the topic ask themselves. My approach was that theorems don’t hurt anyone (I know several young people who would passionately disagree with this sentiment). But it's one question I started thinking about.

Mathematical theorem proving, especially the one based on formal mathematics, is closely related to software verification, and formally verifiable software engineering. It turns out, the main source of funding for research in the US is the Department of Defence, and indeed, this technology has military applications. Talia Ringer discusses a couple concrete examples in her lecture on the ethics of verification.

Besides military applications, it is also conceivable that formal verification could play a role in cybercrime or the creation of malicious software of different types. Verification is used in the design of AntiVirus software - at least at a very handwavy level, it is conceivable that it can be used to make “provably bad” computer viruses, too.

As you can tell I'm very hand-wavy and high-level in my understanding of this, but it's an important topic to consider. If anyone has concrete pointers on the dual use of theorem proving, please send them my way.

Contributions to General Language-based AI

Another possible concern is that a breakthrough in mathematical theorem proving may further accelerate the development and deployment of general-purpose AI tools. And that can be a good thing or a bad thing, depending on your perspective.

Why would this happen? There is anecdotal evidence - or urban legend - out there that including formal languages (computer code) in the training data improves the quality of language models, especially having a positive an effect on their common-sense reasoning ability. Of course, the idea that learning to code improves your ability to reason is sure to put a smug smile on any computer scientist’s face, so I’d be skeptical with how strongly supported this claim is.

But, assuming that the above hypothesis holds - that there is a positive transfer effect from training on formal languages to common sense reasoning ability in LLMs - we can reasonably expect that a bunch of new training data generated in the form of formally verified proofs could turn out to be very valuable, sweet training data for LLMs. In theory, automatic theorem proving or formal verification may provide virtually unlimited high-quality training data for LLMs, further enhancing their apparent reasoning abilities. Depending on what you think about the pace of development and deployment of general-purpose language AIs, this may be an outcome you applaud, or fear.

Loss of meaning

Of course, another topic that comes up often is just how depressing and sad forfeiting mathematics to machine intelligence is. For many, mathematics is not only a job, but a pursuit they derive meaning from. If AI is ever becomes better at Maths than people, as Fields-medalist Timothy Gowers discusses on Twitter, what happens to human mathematics, what will people derive meaning from.

First of all, one can enjoy chess even if not playing at grandmaster level, and even if a Chess program easily beats me. Similarly, I personally enjoy understanding mathematics topics that have long been solved, without being the one actually discovering or proving it. So the 'pleasure of proving something' may not be completely lost. On the other hand 'being the first to prove' may become difficult, but it is already a rare source of pleasure.

Secondly, I think breakthroughs in automated theorem proving will probably pose new kind of questions for folk with mathematical intuition. Perhaps the work is going to be less about trying to prove or disprove conjectures, but a bit more about categorising, making human sense of mathematical knowledge, or guiding an increasingly automated process of mathematical discovery.

Mortal Komputation: On Hinton's argument for superhuman AI.

Ferenc Huszar — Tue, 30 May 2023 13:48:52 GMT

Last week in Cambridge was Hinton bonanza. He visited the university town where he was once an undergraduate in experimental psychology, and gave a series of back-to-back talks, Q&A sessions, interviews, dinners, etc. He was stopped on the street by random passers-by who recognised him from the lecture, students and postdocs asked to take a selfie with him after his packed lectures.

Things are very different from the last time I met Hinton in Cambridge: I was a PhD student, around 12 years ago, in a Bayesian stronghold safe from deep learning influence. There was the usual email about a visiting academic, with an opportunity to put your name down if you wanted a 30 minute 1:1 conversation with him. He told us he figured out how the brain worked (again)! The idea he shared back then would eventually transform to capsule networks. Of course everyone in our lab knew his work, but people didn't quite go as crazy.

While the craziness is partly explained by the success of deep learning, the Turing award, etc, it is safe to say that his recent change of heart on AI existential risk played a big role, too. I have to say, given all the press coverage I already read, I wasn't expecting much from the talks by way of content. But I was wrong there, the talks actually laid out a somewhat technical argument. And it worked - some very smart colleagues are now considering a change in their research direction towards beneficial AI.

I enjoyed the talks, but did I buy the arguments? I suppose I never really do. So I thought I'll try my best to write it up here, followed by a couple points of criticism I have been thinking about since then. Though touching on many topics, including subjective experiences and feelings LLMs might have, he very clearly said he only is qualified to comment on the differences between biological and digital intelligences, which he has studied for decades. Thus, I will focus on this argument, and whether this should, in itself, convince you to change or update your views on AI and X-risk.

Summary

Hinton compares intelligence on digital and analogue hardware.
Analogue hardware allows for lower energy cost but at the cost of mortality: algorithm and hardware are inseparable - the argument goes.
Digital intelligence has two advantages: aggregating learning from parallel experiences, and backpropagation which is implausible on analogue hardware
Hinton concludes these advantages can/will lead to superhuman digital intelligence.
I critically evaluate the claims about both parallelism and the superiority of backprop over biologically plausible algorithms

Mortal Computation

For a long time Hinton, and others, considered our current neural network-based "artificial brains", which run on digital computers, to be inferior to biological brains. Digital neural networks fall short on energy-efficiency: biological brains consume much less energy even though by some measures they are orders of magnitude bigger and more complex than today's digital neural networks.

Hinton therefore set out to build more energy-efficient "brains" based on analogue hardware. Digital computers, he argues, achieve perfect separation of software and hardware by working at the level of abstraction of discrete bits. This enables computation that runs on one computer to be exactly reproduced on any other digital computer. In this sense, the software is immortal: if the hardware dies, the algorithm can live on on another computer. This immortality comes at a high energy price: ensuring digital computers work accurately, they consume quite a lot of energy.

This is in contrast with analogue hardware, which may contain flaws and slight variations in conductances. Thus every analogue computer is slightly different, and learning algorithms running in them have to adapt to the imperfections of analogue hardware. While they may consume a lot less energy, this also means that a "model" trained on one analogue machine cannot be easily ported to another piece of hardware as it has adapted to the specific flaws and imprecisions of the chip it was trained on. Brains running on analogue hardware are mortal: once the hardware dies, the algorithm dies with it.

tldr: anaogue intelligence is energy efficient but mortal, digital intelligence is immortal but energy-hungry

Advantages of digital brains

Hinton then realised that learning algorithms running on digital devices have advantages compared to "mortal" algorithms running on analogue hardware.

Parallelism: Since computation is portable, parallel copies of the same model can be run, and information/knowledge can be exchanged between these copies using high-bandwidth sharing of weights or gradient updates. Consequently, a digital "mind" might be performing tens of thousands of tasks in parallel, then aggregate the learnings from each of these parallel activities into a single brain. By contrast, analogue brains cannot be parallelised this way, because the imprecision of hardware makes communicating information about the contents of the model impossible. The best they can do is to "tell each other" what they learned, and exchange information using an inefficient form of knowledge distillation.

Backpropagation: In addition, a further advantage is that digital hardware allows for the implementation of algorithms like back-propagation. Hinton argued for a long time that backpropagation seems biologically implausible, and cannot be implemented on analogue hardware. The best learning algorithms Hinton could come up with for mortal computation is the forward-forward algorithm, which is resembles evolution strategies. Its updates are a lot noisier compared to backpropagated gradients, and it really doesn't scale to any decent sized learning problem.

These two observations: that digital computation can be parallelised, and enables a superior learning algorithm, backpropagation, which analogue brains cannot implement, lead Hinton to conclude that digital brains will eventually become smarter than biological brains, and based on recent progress he believes this may happen much sooner he had previously thought, within the next 5-20 years.

Does the argument hold water?

I can see a number of ways in which the new arguments laid out for why digital 'brains' will be superior to biological ones could be attacked. Here are the two main points of counterarguments:

How humans learn vs how Hinton's brains learn

Hinton's argument actually critically hinges on artificial neural networks being as efficient at learning from any single interaction as biological brains are. After all, it doesn't matter how many parallel copies of an ML algorithm you run if the amount of "learning" you get from each of those interactions is orders of magnitude smaller than what a human would learn. So let's look at this more closely.

Hinton actually considered a very limited form of learning: imitation learning or distillation. He argues that when Alice teaches something to Bob, Bob will change the weights of his brain so that he becomes more likely to say what Alice just told her in the future. This may be how an LLM might learn, but it's not how humans learn from interaction. Let's consider an example.

As a non-native English speaker, I remember when I first encountered the concept of irreversible binomials in English. I watched a language learning video whose content was very simple, something like:
"We always say apples and oranges, never oranges and apples.
We always say black and white, never white and black.
etc..."
Now, upon hearing this, I understood what this meant. I learnt the rule. Next time I said something about apples and oranges, I remembered that I shouldn't say "oranges and apples". Perhaps I made a mistake, I remembered the rule exists, felt embarrassed, and probably generated some negative reinforcement from which further learning occurred. Hearing this one sentence changed how I apply this rule in lots of specific circumstances, it didn't make me more likely to go around and tell people "We always say apples and oranges, never oranges and apples", I understood how to apply the rule to change my behaviour in relevant circumstances.

Suppose you wanted to teach an LLM a new irreversible binomial, for example that it should never say "LLMs and humans", it should always say "humans and LLMs" instead. With today's model you could either

fine-tune on lots of examples of sentences containing "humans and LLMs", or
show it RLHF instances where a sentence containing "humans and LLMs" was preferred by a human over a similar sentence containing "LLMs and humans"
or prepend the above rule to the prompt in the future, storing the rule in-context. (this one doesn't seem like it would necessarily work well)

In contrast, you can simply tell this rule to a human, they will remember it, recognise if the rule is relevant in a new situation, and use it right away, perhaps even without practice. This kind of 'metacognition' - knowing what to learn from content, recognising if a mistake was made and learning from it - is currently is completely missing from LLMs, although as I wrote above, perhaps not for a very long time.

As a result, even if an LLM sat down with 10,000 physics teachers simultaneously, it wouldn't necessarily get 10,000 more value out of those interactions than a single biological brain spending time with a single physics teacher. That's because LLMs learn from examples, or from human preferences between various generated sentences, rather than by understanding rules and later recalling them in relevant situations. Of course, this may change very fast, this kind of learning from instruction may be possible in LLMs, but the basic point is:

there is a limit to how much learning digital brains can extract from interacting with the world currently

The "it will never work" type arguments

In one of his presentations, Hinton reminded everyone that for a long time, neural networks were completely dismissed: optimisation will get stuck in a local minimum, we said, they will never work. That turned out to be completely false and misleading, local minima are not a limitation of deep learning after all.

Yet his current argument involves saying that "analogue brains" can't have a learning algorithm as good as backpropagation. This is mostly based on the evidence that although he tried hard, he failed to find a biologically plausible learning algorithm that is as efficient as backpropagation in statistical learning. But what if that's just what we currently think? After all the whole ML community could convince ourselves that support vector machines were superior to neural networks? What if we prematurely conclude digital brains are superior to analogue brains just because we haven't yet managed to make analogue computation work better.

Summary and Conclusion

To summarise, Hinton's argument has two pillars:

that digital intelligence can create efficiencies over analogue intelligence through parallelism, aggregating learning from multiple interactions into a single model
and that digital intelligence enables fundamentally more efficient learning algorithms (backprop-based) which analogue intelligence cannot match

As we have seen, neither of these arguments are watertight, and both can be questioned. So how much credence should we put on this?

I say it passes my bar for an interesting narrative. However, as a narrative, I don't consider it much stronger than the ones we developed when we argued "methods based on non-convex optimisation won't work", or "nonparametric ML methods are ultimately superior to parametric ones", or "very large models will overfit".

Whether LLMs, perhaps LLMs with a small number of bells and whistles used creatively will pass the 'human level' bar (solve most tasks a human could accomplish through a text-based interface with the world)? I am currently equally skeptical of the theoretically motivated arguments either way. I personally don't expect anyone to be able to produce a convincing enough argument that it's not possible. I am a lot less skeptical about the whole premise than back in 2016 when I wrote about DeepMind's pursuit of intelligence.

Autoregressive Models, OOD Prompts and the Interpolation Regime

Ferenc Huszar — Thu, 30 Mar 2023 11:54:26 GMT

A few years ago I was very much into maximum likelihood-based generative modeling and autoregressive models (see this, this or this). More recently, my focus shifted to characterising inductive biases of gradient-based optimization focussing mostly on supervised learning. I only very recently started combining the two ideas, revisiting autoregressive models throuh the lens of inductive biases, motivated by a desire to understand a bit more about LLMs. As I did so, I found myself surprised by a number of observations, which really should not have been surprising to me at all. This post documents some of these.

AR models > distributions over sequences

I have always associated AR models as just a smart way to parametrize probability distributions over multidimensional vectors or sequences. Given a sequence, one can write

$$
p(x_1, x_2, \ldots, x_N;\theta) = \prod_{n=1}^{N} p(x_n\vert x_{1:n-1}, \theta)
$$

This way of defining a parametric joint distribution is computationally useful because it is significantly easier to ensure proper normalization of a conditional probability over a single variable than it is to ensure proper normalization over a combinatorially more complex space of entire sequences.

In my head, I considered there to be a 1-to-1 mapping between autoregressive models (a family of conditional probabilities $\{p(x_n\vert x_{1:n-1}), n=1\ldots N\}$) and joint distributions over sequences $p(x_1,\ldots,x_N)$. However, I now understand that is, generally, a many-to-one mapping.

Why is this the case? Consider a distribution over two binary variables $X_1$ and $X_2$. Let's say that, $\mathbb{P}(X_1=0, X_2=0) = \mathbb{P}(X_1=0, X_2=1) = 0$, meaning that $\mathbb{P}(X_1=0)=0$, i.e. $X_1$ always takes the value of $1$. Now, consider two autoregressive models, $Q_1$ and $Q_2$. If these models are to be consistent with $P$, they must agree on some facts:

$\mathbb{Q}_1[X_2=x\vert X_1=1] = \mathbb{Q}_2[X_2=x\vert X_1=1] = \mathbb{P}[X_2=x\vert X_1=1] = \mathbb{P}[X_2=x]$ and
$\mathbb{Q}_1[X_1=1] = \mathbb{Q}_2[X_1=1] = 1$.

However, they can completely disagree on the conditional distribution of $X_2$ given that the value of $X_1$ is $0$, for example $\mathbb{Q}_1[X_2=0\vert X_1=0] =1$ but $\mathbb{Q}_2[X_2=0\vert X_1=0] =0$. The thing is, $X_1$ is never actually $0$ under either of the models, or $P$, yet both models still specify this conditional probability. So, $Q_1$ and $Q_2$ are not the same, yet they define exactly the same joint distribution over $X_1$ and $X_2$.

Zero probability prompts

The meaning of the above example is that two AR models that define exactly the same distribution over sequences can arbitrarily disagree on how they complete $0$-probability prompts. Let's look at a more language modely example of this, and why this matters for concepts like systemic generalisation.

Let's consider fitting AR models on a dataset generated by a probabilistic context free grammar $\mathbb{P}[S=\"a^nb^n\"]=q^n(1-q)$. To clarify my notation, this is a distribution over sentences $S$ that take are formed by a number of `a` characters followed by the same number of `b` charactes. Such a langauge would sample $ab$, $aabb$, $aaabbb$, etc, with decreasing probability, and only these.

Now consider two autoregressive models, $Q_1$ and $Q_2$, which both perfectly fit our training distribution. By this I mean that $\operatorname{KL}[Q_1\|P] = \operatorname{KL}[Q_2\|P] = 0$. Now, you can ask these models different kinds of questions.

You can ask for a random sentence $S$, and you get random samples returned, which are indistinguishable from sampling from $P$. $Q_1$ and $Q_2$ are equivalent in this scenario.
You can use a weird sampling method, like temperature scaling or top-p sampling to produce samples from a different distribution than $P$. It's a very interesting question what this does, but if you do this in $Q_1$ and $Q_2$, you'd still get the same distribution of samples out, so the two models would be equivalent, assuming they both fit $P$ perfectly.
You can ask each model to complete a typical prompt, like $aaab$. Both models will complete this prompt exactly in the same way, responding $bb$ and the end of sequence token. The two models are equivalent in terms of completions of in-distribution prompts.
You can ask each model to complete a prompt that would never naturally be seen, like $aabab$, the two models are not equivalent. P doesn't prescribe how this broken prompt should be completed as it doesn't follow the grammar. But both $Q_1$ and $Q_2$ will give you a completion, and they may give completely different ones.

When asked to complete an out-of-distribution prompt, how will the models behave? There is no right or wrong way to do this, at least not that one can derive from $P$ alone. So, let's say:

$Q_1$ samples completions like $\mathbf{aabab}b$, $\mathbf{aabab}bab$, $\mathbf{aabab}bbaba$, while
$Q_2$ samples $\mathbf{aabab}$, $\mathbf{aabab}bbb$, $\mathbf{aabab}bbbbbb$.

In this example we can describe the behaviour of each of the two models. $Q_1$ applies the rule "the number of $a$s and $b$s must be the same". This is a rule that holds in $P$, so this model has extrapolated this rule to sequences outside of the grammar. $Q_2$ follows the rule "once you have seen a $b$ you should not generate any more $a$s". Both of these rules apply in $P$, and in fact, the context free grammar $a^nb^n$ can be defined as the product of these two rules (thanks to Gail Weiss for pointing this out during her Cambridge visit). So we can say that $Q_1$ and $Q_2$ extrapolate very differently, yet, they are equivalent generative models at least so far as the distribution of sequences is concerned. They both are global minima of the training loss. Which one we're more likely t get as a result of optimization is up to inductive biases. And of course, there's more than just these two models, there are infinitely many different AR models consistent with P that differ in their behaviour.

Consequence for maximum likelihood training

An important consequence of this is that model likelihood alone doesn't tell us the full story about the quality of an AR model. Even though cross entropy is convex in the space of probability distributions, it is not strictly convex in the space of AR models (unless the data distribution places non-zero probability mass on all sequences, which is absolutely not the case with natural or computer language). Cross entropy has multiple - infinitely many - global minima, even in the limit of infinite training data. And these different global minima can exhibit a broad range of behaviours when asked to complete out-of-distribution prompts. Thus, it is down to inductive biases of the training process to "choose" which minima are likelier to be reached than others.

Low probability prompts

But surely, we are not just interested in evaluating language models on zero probability prompts. And how do we even know if the prompts we are using are truly zero probability under the "true distribution of language"? Well, we don't really need to consider truly zero probability promots. The above observation can likely be softened to consider AR models' behaviour on a set of prompts with sufficiently small probability of occurrence. Here is a conjecture, and I leave it to readers to formalize this fully, or prove some simple bounds for this:

Given a set $\mathcal{P}$ of prompts such that $\mathbb{P}(\mathcal{P}) = \epsilon$, it is possible to find two models $Q_1$ and $Q_2$ such that $Q_1$ and $Q_2$ approximate $P$ well, yet $Q_1$ and $Q_2$ give very different completions to prompts in \mathcal{P}.

Consequence: so long as a desired 'capability' of a model can be measured by evaluating completions of an AR model on a set of prompts that occurr with small probability, cross-entropy cannot distinguish between capable and poor models. For any language model with good performance on a limited benchmark, we will be able to find another language model that matches the original on cross entropy loss, but which fails on the benchmark completely.

A good illustration of the small probability prompt set is given by Xie et al (2021) who interpret in-context learning ability of language models as implicit Bayesian inference. In that paper, the set of prompts that correspond to in-context few-shot learning prompts have vanishing probability under the model, yet the model gives meaningful responses in that case.

Language Models and the Interpolation Regime

Over the years, a large number of mathematical models, concepts and analysis regimes have been put forward in an attempt to generate meaningful explanations of the success of deep learning. One such idea is to study optimization in the interpolation regime: when our models are powerful enough to achieve zero training error, and mimimizing the training loss thus becomes an underdetermined problem. The interesting outstanding question becomes which of the global minima of the training loss our optimization selects.

The scenarios often studied in this regime include regression with a finite number of samples or separable classification where even a simple linear model can reach zero training error. I'm very excited about the prospect of studying AR generative models in the interpolation regime: where cross entropy approaches the entropy of the training data. Can we say something about the types of OOD behaviour gradient-based optimization favours in such cases?

We May be Surprised Again: Why I take LLMs seriously.

Ferenc Huszar — Wed, 22 Mar 2023 14:55:54 GMT

"Deep Learning is Easy, Learn something Harder" - I proclaimed in one of my early and provocative blog posts from 2016. While some observations were fair, that post is now evidence that I clearly underestimated the impact simple techniques will have, and probably gave counterproductive advice.

I wasn't alone in my deep learning skepticism, in fact I'm far from being the most extreme deep learning skeptic. Many of us who grew up working in Bayesian ML, convex optimization, kernels and statistical learning theory confidently predicted the inevitable failure of deep learning, continued to claim deep nets did nothing more than memorize training data, ignoring all evidence to the contrary.

What was behind this? Beyond the unappealing hackiness of early DL, a key reason is that we misapplied some intuitions from key work in statistical learning theory. Well-known findings from learning theory (of the Vapnik–Chervonenkis or Rademacher flabour) gave guarantees for generalisation when the model class was sufficiently small. Many of us informally misused these results to imply that "you can't have generalisation unless your model class is simple". And deep learning is the opposite of a simple model classes. Ergo, it won't/can't work. Any evidence to the contrary was therefore dismissed as cherry picking, random seed hacking or overfitting dressed up as success.

To be clear, there has been a lot of poorly motivated or poorly reproducible research, especially in RL, and thus some skepticism was well justified. To alleviate doubts, I still think theoretically motivated skepticism is a good thing, and rigour - which much of early deep learning lacked - is important. But some of deep learning is clearly more than overfitting dressed up as success, and many of us resisted this general idea for too long.

What changed this? An important change-point in many of our attitudes was the 2016 paper Understanding deep learning requires rethinking generalization. The way I remember this paper is this: "deep nets have maximal Rademacher complexity, generalisation theory thus predicts deep learning shouldn't work, but it is clear it does, therefore our theory is insufficient." This seems almost trivial now, but back then, it represented a massive shift. It's the theory that needed fixing, not deep learning. It opened up a massive opportunity for people to come up with new theory, develop new intuitions. We're nowhere near a modern theory of deep learning, but we know a lot more about the components at play.

It may have been alchemy, but some actual gold was produced.

Are we wrong about LLMs?

Today, I see a similar pattern of resistance to taking LLM results seriously. Many of my colleagues' views on LLMs have not changed at all over the last couple of years. Some describe them as being good only at "regurgitating training data". I also see similar patterns of misusing and overgeneralising theoretical arguments.

For example, the field of causal inference established the impossibility of inferring causal structure from i.i.d. observations. I wrote about this in my post on causal inference. Many people then overgeneralise this important but narrow finding to mean "an ML model can absolutely never learn causal reasoning unless you add some extra causal model". But what exactly does this result have to do with whether LLMs can (appear to) correctly reason about causal structure in the world when they complete prompts? LLMs, indeed, are statistical models fit to i.i.d. data. But the non-identifiability result is only relevant if we apply it to learning causal relationships between consecutive tokens in text. But those are not the causal relationships we need. We want the LLM to "understand" or appear to understand that kicking a ball results in the ball moving. That causal relationship is encoded in language, it doesn't have to be inferred from i.i.d. observational data.

My own attempts at dismissing LLMs

My instinct has been to reject the idea of pre-trained LLMs as the general-purpose inference machines they are branded as. I visited OpenAI in 2018, where I was shown a preview of what turned out to be key GPT-2 and GPT-3 results: the ability for a pre-trained LM to solve problems it wasn't explicitly trained, via zero-shot prompt engineering. My immediate reaction was that this can't possibly work. Sure, you'll be able to do it better than chance, but this approach will never be even nearly competitive with specialised solutions. It's interesting to try and formalize why I thought this. I had two kinds of thoughts about this:

The No Free Lunch flavoured argument

Even though I personally never thought very much of no free lunch theorems in ML, in retrospect my reason for dismissing the 'one model can do it all' approach of GPT was essentially a no free lunch argument in disguise.

I tried to formalize my argument along these lines:

Consider a distribution of tasks we want to solve. For simplicity, let's assume each task is a supervised prediction problem, where we have a joint distribution over some input $x$ and corresponding desired output $y$. Each task $\mathcal{T}$ would then be a joint distribution over $(x, y)$ pairs, and perhaps an associated loss function.
I considered the language model as being nothing more than a stochastic process over character sequences. I didn't really care it was trained on natural language. I just thought of it as distribution of completions given prompts. I was happy to assume that the LLM could be the best such distribution there is.
We use the LLM to solve a task $\mathcal{T}$ in an encoder-LLM-decoder sandwich. The encoder would be a mapping between some input $x$ to a character sequence or prompt (a.k.a. the prompt engineering part). The decoder would then take the LLM's completion and return a label or prediction $\hat{y}$ of some sort.

In this setting, it feels intuitively true that the more distinct tasks $\mathcal{T}_i$ we want to be able to solve satisfactorily, the more likely it will be that conflicts emerge between the different tasks. Mathematically, this could be resolved by increasing the average length of prompts needed to solve all tasks. For example, if you only want your model to be able to solve a single task, like English-Spanish translation, your prompt only has to contain the source sentence. If you now also want to do English-French tanslation, or sentiment prediction, your prompts have to be longer to indicate to the model what task it should solve and how it should interpret the prompt. My hypothesis was, informally, that if we want a single encoder-LLM-decoder model to be able solve all tasks we care about, the prompt lengths in that model would have to be exponentially long at the very least. I think it likely that one could prove a formal result along these lines for a sufficiently rich of target tasks. This kind of thinking is very similar to Turing machine/universal grammar thinking, and there are a lot of parallels to Chomsky's objections to LLM.

However, I no longer subscribe to this no free lunch argument, for the same reason I actually never really felt my work was limited by other no free lunch theorems.

It's very difficult to describe all tasks we ever want an agent to solve before we label it 'intelligent' or 'capable', or 'a good Bing'. Let's call these tasks useful tasks. I now suspect that useful tasks are a tiny sliver of "all possible tasks" one would probably encounter in a formally stated no free lunch theorem. We don't need AI to solve any adversarially chosen task, we'll be happy if it can solve the typical task we expect it to solve. This is similar to saying we don't actually care about approximating any measurable function over images, we care about building tools that can do certain things in images, like recognise pedestrians.

The "We're optimizing the wrong objective function" argument

I have also been a long-time proponent of paying attention to the objective function, and less attention to things like architecture. I said if your objective function you optimise doesn't reflect the task you're using your model for, no amount of engineering or hacks will help you bridge that gap. I made this argument repeatedly in the context of representation learning, generative modelling, video compression, etc.

I argued that maximum likelihood is not a good objective function for representation learning, creating the memorable poop-diagram, which I continue to use in my lectures today. As LLMs are being trained via likelihood, it's only natural that my first instinct was that maximum likelihood can't be a good objective function for generally intelligent behaviour™ either. Why would getting better at next-token-prediction lead to near-optimal behaviour in a range of tasks that are, at best, underrepresented in the training data?

I have now abandoned this argument as well. Why? Because these arguments of mine do not consider inductive biases of the training process. I have now realised that $\operatorname{argmin}\mathcal{L}$ is a very poor description of what actually happens in deep learning. It's pointless to hope that any minimum of a loss function will have a desired property, it's sufficient in practice if the loss function has some good minima with the desired attributes and that SGD has a tendency to find those over the bad minima.

True, we have barely a clue on what inductive biases SGD on a model like GPT-3 has - even less if we consider components like RLHF or CoT prompting. But the fact that we that can't describe it doesn't mean that unreasonably helpful inductive biases can't be there. And evidence is mounting that they are there.

As intellectually unsatisfying as this conclusion is, the LLM approach works, but most likely not for any of the reasons we know. We may be surprised again.

Implicit Bayesian Inference in Large Language Models

Ferenc Huszar — Thu, 03 Mar 2022 13:57:26 GMT

This intriguing paper kept me thinking long enough for me to I decide it's time to resurrect my blogging (I started writing this during ICLR review period, and realised it might be a good idea to wait until that's concluded)

Sang Michael Xie, Aditi Raghunathan, Percy Liang and Tengyu Ma (2021) An Explanation of In-context Learning as Implicit Bayesian Inference

I liked this paper because it relates to one of my favourite concepts and ideas: exchangeability. And it took me back to thoughts I had back in 2015 (pre-historic by deep learning standards) about leveraging exchangeable sequence models to implement giant general-purpose learning machines. In that old post I made this observation about exchangeable models:

If we had an exchangeable RNN, we could train it on multiple unsupervised learning problems over the same input space. Such system actually learns to learn. If you want to use it on a new dataset, you just feed it into the RNN, and it will give you Bayesian predictive probabilities without any additional computation. So it would be an ultimate general inference machine™.

Fast forwarding a bit, ultimate general inference machine (fortunate I trademarked it) is not actually all that far from how OpenAI's GPT-3 is sometimes branded and used. It's been demonstrated that you can repurpose them as few-shot (or in some cases zero-shot) learners in a surprising variety of tasks (Brown et al, 2020). This ability of language models to solve different tasks by feeding them cleverly designed prompts is sometimes referred to as prompt-hacking or in-context learning.

Quite honestly, I never connected these dots until I read this paper: the motivations for leveraging one single big exchangeable sequence model as a general-purpose learner, and the more recent trend of in-context learning using GPT-3. In fact, I was deeply skeptical about the latter, thinking of it as another hack that must be somehow fundamentally flawed. But this paper by Xie et al (2021) connected those dots for me, which is why I found it so fascinating, and I will never think of 'prompt hacking' or in-context learning quite the same way.

Exchangeable sequences as Implicit Learning Machines

Before talking about the paper, let me first refresh those old ideas about exchangeable sequences and implicit learning. An exchangeable sequence model is a probability distribution $p(x_1, x_2, \ldots)$ over sequences that is invariant to permutations of the tokens within the sequence, i.e. $p(x_1, x_2, \ldots, x_N) = p(x_{\pi_1}, x_{\pi_2}, \ldots, x_{\pi_N})$ for any permutation $\pi$.

The de Finetti theorem connects such sequence models to Bayesian inference, saying that any such distribution can be decomposed as a mixture over i.i.d. sequence models:

$$
p(x_1, x_2, \ldots, x_N) = \int \prod_{n=1}^N p(x_n\vert \theta) d\pi(\theta)
$$

As a consequence, the one-step-ahead predictive distribution (which predicts the next token in the sequence) also always has a decomposition as Bayesian integration:

$$
p(x_N \vert x_1, x_2, \ldots, x_{N-1}) = \int p(x_N\vert \theta) d\pi(\theta\vert x_1, \ldots, x_N),
$$

where $\pi(\theta\vert x_1, \ldots, x_N)$ is the Bayesian posterior obtained from the prior $\pi(\theta)$ via the Bayes rule:

$$
\pi(\theta \vert x_1, x_2, \ldots, x_{N-1}) \propto \pi(\theta) \prod_{n=1}^{N-1}p(x_n\vert \theta)
$$

So in this sense, if we have an exchangeable sequence model, we can think of these one-step-ahead predictive distributions as implicitly performing Bayesian inference. Crucially, this happens even if we don't know what $\theta$ is, or what our prior $\pi$ is, or what the likelihood $p(x_i\vert \theta)$ is. We don't have to explicitly specify what those components are, de Finetti theorem guarantees that they exist, so long as the predictions $p(x_N \vert x_1, x_2, \ldots, x_{N-1})$ are consistent with an exchangeable sequence model.

This thought motivated me to try and design RNNs (remember this was in pre-transformer times) that always produce exchangeable distributions by construction. This turned out to be very difficult, but the idea eventually evolved into BRUNO (named after Bruno de Finetti), a flexible meta-trained model for exchangeable data which exhibited few-shot concept learning abilities. This idea then got extended in a number of ways in Ira Korshunova's PhD thesis.

From Exchangeable sequences to Mixtures of HMMs

But GPT-3 is a language models, and clearly language tokens are not exchangeble. So whats the connection?

There are interesting extensions to the concept of exchangeability, which come with interesting generalisations of the de Finetti-type theorems. Partial exchangeability, as defined by Diaconis and Freedman (1980), is an invariance property of a distribution over sequences which guarantees that the sequence can be decomposed as a mixture of Markov-chains. Thus, one can say that a partially exchangeable process implicitly performs Bayesian inference over Markov chains, much the same way exchangeable processes can be said to be preforming inference over i.i.d. data generating processes.

In this new paper, Xie et al (2021), assume that the sequence model we work with is a mixture of hidden Markov models (HMMs). This is more general still than the partially exchangeable mixture of Markov chains of Diaconis and Freedman. I don't know if Mixtures of HMMs (MoHMMs) can be characterised by an exchangeability-like invariance property, but that's somewhat irrelevant now. In fact, Xie et al (2021) never mention exchangeability in the paper. The core argument about implicit Bayeisan inferencec holds every time we work with a sequence model which is a mixture of simpler distributions: you can interpret the one-step-ahead predictions as implicitly performing Bayesian inference over some parameter. While it is unlikely that the distribution of human language from the internet follows a MoHMM distribution, it is reasonable to assume that the distribution over sequences that comes out of a GPT-3 is perhaps a mixture of some sort. And if that is the case, predicting the next token implicitly performs Bayesian inference over some parameter $\theta$, which the authors refer to as a 'concept'.

In context learning

The core idea of this paper is that perhaps in-context learning exploits this implicit Bayesian inference, inherent to statistical models of language, to solve tasks. Language models learn to implicitly make probabilistic inferences about concepts - whatever those are - because learning to carry out such inferences is needed to do well on next-token-prediction tasks. One that implicit learning capability is there, one can hijack it to perform other tasks that also require such inferences, including few-shot classification.

I think this is a very intriguing general idea. But then the key question the authors focus is somewhat disappointingly specific and artificial: Although a MoHMMs can be used to `complete sequences` drawn from a specific HMM (one of the mixture components) what happens if we ask the MoHMMs to complete sequences that they would never natively generate, for example an artificially constructed sequence that has a few-shot classification task embedded inside. This then becomes a question about distribution mismatch. The key findings is that, despite this distribution mismatch, the implicit inference machinery inside MoHMMs is able to identify the right concept and use it to make correct predictions in the few-shot task.

However - and please read the paper for specific details - , the analysis makes very strong assumptions about how the in-context-learning task embedded in the sequence is related to the MoHMM distribution. In a way, the in-context task the authors study is in fact more like a few-shot sequence completion task than, say, a classification task.

All in all, this was a fun paper to think about, and one that definitely changed my way of thinking about the whole in-context-learning and language-models-as-few-shot-learners direction.

Eastern European Guide to Writing Reference Letters

Ferenc Huszar — Mon, 28 Feb 2022 14:29:15 GMT

Excruciating. One phrase I often use to describe what it's like to read reference letters for Eastern European applicants to PhD and Master's programs in Cambridge.

Even objectively outstanding students often receive dull, short, factual, almost negative-sounding reference letters. This is a result of (A) cultural differences - we are very good at sarcasm, painfully good at giving direct negative feedback, not so good at praising others and (B) the fact that reference letters play no role in Eastern Europe and most professors have never written or seen a good one before.

Poor reference letters hurt students. They give us no insight into the applicant's true strengths, and no ammunition to support the best candidates in scholarship competitions or the admission process in general. I decided to write this guide for students so they can share it with their professors when asking for reference letters. Although reading letters from the region is what triggered me to write this, mist of this advice should be generally useful for many other people who don't know how to write good academic reference letters.

Illustration of Eastern European subjective scale. Source: the almighty Internet.

High-Level Goals

Help the supervisor to make a case for admitting a student: The reference letter is very important in the whole admissions process. In competitive places in Europe, there is often competition not just between applicants, but also between different research groups and supervisors about whose student gets funding. Reference letters are often used as ammunition to justify decisions internally, and to determine who gets prioritised for various scholarship and funding competitions.
Help put candidate's profile into context: If you write a reference letter from a region like Eastern Europe, keep in mind how difficult it is to compare candidates from wildly different education systems and backgrounds. Is someone with a 4.9/5.0 GPA from Hungary more impressive than someone with a 9.5/10.0 GPA from Serbia? Your job, partly, is to explain to the admissions committee what the student's achievements mean in a global context. Do not use abbreviations that are not internationally obvious. Do not assume the reader has ever heard of your institution. Explain everything.

Basic Hygiene and Format

Confidentiality: Please do not ask the student to write their own recommendation letter. Sadly, many professors do it, but this is not acceptable, especially for the best students who apply to a top institution. You can also assume your reference letter is confidential. Don't share it with the student directly (Why? You probably want to write nicer things than you are comfortable sharing with them directly.)
Length: Reference letters for the best candidates are often 2 full pages long. Something that's half a page or just two paragraphs is interpreted as 'weak support' or worse.
Format: Although plain text is often accepted on submissions forms, when possible, please submit a PDF on letter-headed format (where the institution's logo, name, etc appear on the header). The format should follow the layout of a formal letter. You may address it 'To Whom It May Concern,' or 'To the Admissions Committee,' or to 'Dear Colleagues,' or if you know the potential supervisor, by all means make it personal, address it to them. Obviously sign the letter with your name and title.
Basic contents: Make sure that the letter mentions your full job title, affiliation, the candidate's full name, and the name of the programme/job/scholarship they are applying for.

Below is an example structure that is often used. (I'll use Marta as an example because I don't have a student called Marta so it won't get personal)

Introduction: A few sentences mentioning who you're recommending and for what program, for example "I'm writing to recommend Marta Somethingova for the Cambridge MPhil in Advanced Computer Science." The second sentence should clearly indicate how strongly you are recommending this candidate. Factual statements signal this is a lukewarm recommendation (they asked me and I had to write something). To convey your enthusiasm, you can write something like ''Marta is the strongest student I've worked with in the last couple years".
Context, How do I know Marta: Since when, in what capacity and how closely you have known Marta. This is important - a reference from a thesis supervisor who has worked with the student for a year is more informative than a reference from someone who only met them in one exam. If you've done a project together, include details on how many times you've met, etc. What was the project about, how challenging was it, what was the student's contribution.
Marta's academic results/performance, in context: How good is Marta, compared to other students/persons in a similar context? Be aware that whoever is reading your letter may not know your country's marking scheme, so something like a GPA of 4.8 out of 5 isn't all that informative. Try to put that in context as much as you can: how many other students would achieve similar results in your institution? Best if you can give a rank index (#8 out of a cohort of 300) relative to the whole cohort. Context on your institution: Similarly, assume the reader has no idea how selective your institution is, so include a few details like 'top/most selective computer science program in the country' or something. Try to put this in context by making a prediction about how well the student will do in the course you're recommending them to, or how well they would have done in a more challenging program. Do your research here, if you can.
Details of research/project, if applicable: If you're recommending someone who has worked on a research project with you, include enough technical information (ideally with references or pointers) so that the reader can judge how serious that project was, and what Marta's contribution was. Don't worry, nobody is going to steal your research idea if you write it down in a recommendation letter - we are way too busy reading reference letters to do any research :D
Marta's specific strengths: What quality of Marta do you think will be first noticed in an interview? Is Marta particularly good at understanding complex ideas fast? Is Marta very good at getting things done? Or writing clean code, mentoring others? Where appropriate, try to focus on talent and potential, before commenting on diligence or effort: If the first thing you write is "Marta is very hard working" it may be misinterpreted as a covert way of saying she tries very hard because she is not as good as the students who just get it without much effort. Be conscious of possible gender stereotypes that often come up here. E.g. she is quiet. Make a prediction about Marta's career prospects: She's on a good track to an academic career/well positioned for a career in industry. Please consider what the people reading your letter will want to see. If you recommend someone for an academic pure Maths program, you don't want to say the student is well positioned to end up in a boring finance job. If you feel like you MUST, you can include relative weaknesses here, but please phrase these as opportunities of growth, and what Marta needs to improve.
Other/extracurricular activities: If you're aware of other things the student is doing - like organising meetups, volunteering, competitions, whatever - you can include them here if you feel they are relevant. Your job, again, is to put these in context.
Further background on Marta's education history: This may be useful to support candidates who achieved impressive things in their country, but whose achievements may not make a lot of sense in an international context. For example, did they go to a very selective secondary school known for some specialization? Or, to the contrary, did they do exceptionally well despite not having access to the best education? Did they participate in country-specific olympiads or competitions? If so, what do those results mean? How many students do those things? Did they get a scholarship for their academic performance? If so, how many students get those? Did they participate in some kind of university activity? If so, what's the relevance of that? The most important assumption to remember is: Whoever reads the applicant's CV or your recommendation letter will know absolutely nothing about your country. You have to fill in the blanks, and explain everything from the ground up. NO ACRONYMS!
Your mini-CV: It's worth including one paragraph about yourself, the referee. What is your job title, how long you've been doing what you're doing, what's your specialty, etc. The purpose of this is to prove you are qualified and able to spot talent. Make this as internationally attractive and meaningful as you can.
Conclusion: Here is your chance to reiterate the strength of your recommendation. If you think you're describing a not-to-miss candidate, say so explicitly. One sentence we often include here is along the lines of 'If Marta were to apply for a PhD/Masters under my supervision I would not hesitate to take her as a student'.

Relative ranking of students

Often, reference submission websites ask you to place the student in the top X% of students you've worked with. More depends on this than you might think. Be honest, but be aware that these judgments often go into a formula for scoring or pre-filtering applications. In a competitive program, if you say someone is top 20%, that is likely a death sentence for the student's chances of getting a scholarship. Again, don't lie, just make sure you don't put the student in a lower bucket than they really deserve to be in.

Writing style

Be aware of cultural differences in how we praise others and give direct feedback to/on colleagues. I often recommend the Culture Map book by Erin Meyer on this topic. Though individuals are individuals, by and large, those who socialise in the U.S. Academic system tend to write recommendation letters with a higher baseline level of enthusiasm. If you feel your letter is too positive, that may be appropriate compensation for these differences, so long as your letter is honest, of course.
Writing style and tone are the most difficult to get right if you haven't seen examples before. I suggest you write a draft a couple weeks before submitting a letter, and then return to it before submitting. Re-reading after a week often allows you to better notice where the letter isn't conveying what you wanted.
Ask for help! If you have a candidate you enthusiastically support, don't be afraid to ask for help writing the reference letter. Ideally, ask someone who is experienced, doesn't know the candidate, and who is not part of the decision making at the institution the student is applying.

In summary, please take time to write strong recommendation letters for your best students. There may not be many students at your institution who apply to top programs, but those who do are likely the ones who really need and deserve your attention.

Causal inference 4: Causal Diagrams, Markov Factorization, Structural Equation Models

Patrik Reizinger — Thu, 10 Jun 2021 13:56:20 GMT

This post is written with my PhD student and now guest author Patrik Reizinger and is part 4 of a series of posts on causal inference:

One way to think about causal inference is that causal models require a more fine-grained models of the world compared to statistical models. Many causal models are equivalent to the same statistical model, yet support different causal inferences. This post elaborates on this point, and makes the relationship between causal and statistical models more precise.

Markov factorization

Do you remember those combinatorics problems from school where the question was how many ways exist to get from a start position to a target field on a chessboard? And you can only move one step right or one step down. If you remember, then I need to admit that we will not consider problems like that. But its (one possible) takeaway actually can help us to understand Markov factorizations.

You know, it is totally indifferent how you traversed the chessboard, the result is the same. So we can say that - from the perspective of target position and the process of getting there - this is a many-to-one mapping. The same holds for random variables and causal generative models.

If you have a bunch of random variables - let's call them $X_1, X_2, \dots, X_n$ -, their joint distribution is $p \left(X_1, X_2, \dots, X_n \right) $. If you invoke the chain rule of probability, you will have several options to express this joint as a product of factors:

$$
p \left(X_1, X_2, \dots, X_n \right) = \prod p(X_{\pi_i}\vert X_{\pi_1}, \ldots, X_{\pi_{i-1}}),
$$

where $\pi_i$ is a permutation of indices. Since you can do this for any permutation $\pi$, the mapping between such factorizations and the joint distribution they express is many-to-one. As you can see this in the image below. The different factorizations induce a different graph, but have the same joint distribution.

Since you are reading this post, you may already be aware that in causal inference we often talk about a causal factorization, which looks like

$$
p \left(X_1, X_2, \dots, X_n \right) = \prod_{i=1}^{n} p\left(X_i | X_{\mathrm{pa}(i)}\right),
$$

where $\mathrm{pa}(X_i)$ denotes the causal parents of node $X_i$. This is one of many possible ways you can factorize the joint distribution, but we consider this one special. In the recent work, Schölkopf et al. call it a disentangled model. What are disentangled models? Disentangled factors describe independent aspects of the mechanism that generated the data. And they are not independent because you factored them in this way, but you were looking for this factorization because its factors are independent.

In other words, for every joint distribution there are many possible factorizations, but we assume that only one, the causal or disentangled factorization, describes the true underlying process that generated the data.

Let's consider an example for disentangled models. We want to model the joint distribution of altitude $A$ and temperature $T$. In this case, the causal direction is $A \rightarrow T$ - if the altitude changes, the distribution of the temperature will change too. But you cannot change the altitude by artificially heating a city - otherwise we all would enjoy views as in Miami; global warming is real but fortunately has no altitude-changing effect.
In the end, we get the factorization of $p(A)p(T|A)$. The important insights here are the answers to the question: What do we expect from these factors? The previously-mentioned Schölkopf et al. paper calls the main takeaway the Independent Causal Mechanisms (ICM) Principle, i.e.

By conditioning on the parents of any factor in the disentangled model, the factor will neither be able to give you further information about other factors nor is able to influence them.

In the above example, this means that if you consider different countries with their altitude distributions, you can still use the same $p(T|A),$ i.e., the factors generalize well. For no influence, the example holds straight above the ICM Principle. Furthermore, knowing any of the factors - e.g. $p(A)$ - won't tell anything about the other (no information). If you know which country you are in, so will have no clue about the climate (if you consulted the website of the corresponding weather agency, that's what I call cheating). In the other direction, despite being the top-of-class student in climate matters, you won't be able to tell the country if somebody says to you that here the altitude is 350 meters and the temperature is 7°C!

Statistical vs causal inference

We discussed Markov factorizations, as they help us understand the philosophical difference between statistical and causal inference. The beauty, and a source of confusion, is that one can use Markov factorizations in both paradigms.

However, while using Markov factorizations is optional for statistical inference, it is a must for causal inference.

So why would a statistical inference person use Markov factorizations? Because they make life easier in the sense that you do not need to worry about too high electricty costs. Namely, factorized models of data can be computationally much more efficient. Instead of modeling a joint distribution directly, which has a lot of parameters - in the case of $n$ binary variables, that is $2^n-1$ different values -, a factorized version can be pretty lightweight and parameter-efficient. If you are able to factorize the joint in a way that you have 8 factors with $n/8$ variables each, then you can describe your model with $8\times2^{n/8}-1$ parameters. If $n=16$, that is $65,535$ vs $31$. Similarly, representing your distibution in a factorized form gives rise to efficient, general-purpose message-passing algorithms, such as belief propagation or expectation propagation.

On the other hand, causal inference people really need this, otherwise, they are lost. Because without Markov factorizations, they cannot really formulate causal claims.

A causal practicioner uses Markov factorizations, because this way she is able to reason about interventions.

If you do not have the disentangled factorization, you cannot model the effect of interventions on the real mechanisms that make the system tick.

Connection to domain adaptation

In plain machine learning lingo, what you want to do is domain adaptation, that is, you want to draw conclusions about a distribution you did not observe (these are the interventional ones). The Markov factorization prescribes ways in which you expect the distribution to change - one factor at a time - and thus the set of distributions you want to be able to robustly generalise to or draw inferences about.

Do calculus

Do-caclculus, the topic of the first post in the series, can be relatively simply described using Markov factorizations. As you remember, $\mathrm{do}(X=x)$ means that we set the variable $X$ to the value $x$, meaning that the distribution of that variable $p(X)$ collapses to a point mass. We can model this intervention mathematically by replacing the factor $p( x \vert \mathrm{pa}(X))$ by a Dirac-delta $\delta_x$, resulting in the deletion of all incoming edges of the intervened factors in the graphical model. We then marginalise over $x$ to calculate the joint distribution of the remaining variables. For example, if we have two variables $x$ and $y$ we can write:

$$
p(y\vert do(X=x_0)) = \int p(x,y) \frac{\delta(x - x_0)}{p(x\vert y)} dx
$$

SEMs, Markov factorization, and the reparamtrization trick

If you've read the previous parts in this series, you'll know that Markov factorizations aren't the only tool we use in causal inference. For counterfactuals, we used structural equation models (SEMs). In this part we will illustrate the connection between these with a cheesy reference to the reparametrization trick used in VAEs among others.

But before that, let's recap SEMs. In this case, you define the relationship between the child node and its parents via a functional assignment. For node $X$ with parents $\mathrm{pa}(X)$ it has the form of

$$
X = f(\mathrm{pa}(X), \epsilon),
$$

with some noise $\epsilon.$ Here, you should read "=" in the sense of an assigment (like in Python), in mathematics, this should be ":=".
The above equation expresses the conditional probability $ p\left(X| \mathrm{pa}(X)\right)$ as a deterministic function of $X$ and some noise variable $\epsilon$. Wait a second..., isn't it the same thing what the reparametrization trick does? Yes it is.

So the SEM formulation (called the implicit distribution) is related via the reparametrization trick to the conditional probability of $X$ given its parents.

Classes of causal models

Thus, we can say that a SEM is a conditional distribution, and vica versa. Okay, but how do the sets of these constructs relate to each other?
If you have a SEM, then you can read off the conditional, which is unique. On the other hand, you can find more SEMs for the same conditional. Just as you can express a conditional distribution in multiple different ways using different reparametrizations, it is possible to express the same Markov factorization by multiple SEMs. Consider for example that if your distribution is $\mathcal{N}(\mu,\sigma),$ then multiplying it by -1 gives you the same distribution. In this sense, SEMs are a richer class of models than Markov factorizations, thus they allow us to make inferences (counterfactual) which we weren't able to express in the more coarse grained language of Markov Factorizations.

As we discussed above, a single joint distribution has multiple valid Markov factorizations, and the same Markov factorization can be expressed as different SEMs. We can think of joint distributions, Markov factorizations, and SEMs as increasingly fine-grained model classes: joint distributions $\subset$ Markov facorizations $\subset$ SEMs. The more aspects of the data generating process you model, the more elaborate the set of inferences you can make become. Thus, Joint distributions allow you to make predictions under no mechanism shift, Markov factorizations allow you to model interventions, SEMs allow you to make counterfactual statements.

The price you pay for more expressive models is that they also get generally much harder to estimate from data. In fact, some aspects of causal models are impossible to infer from i.i.d. observational data. Moreover, some counterfactual inferences are experimentally not verifiable.

On Information Theoretic Bounds for SGD

Ferenc Huszar — Fri, 23 Apr 2021 14:17:49 GMT

Few days ago we had a talk by Gergely Neu, who presented his recent work:

Gergely Neu Information-Theoretic Generalization Bounds for Stochastic Gradient Descent

I'm writing this post mostly to annoy him, by presenting this work using super hand-wavy intuitions and cartoon figures. If this isn't enough, I will even find a way to mention GANs in this context.

But truthfully, I'm just excited because for once, there is a little bit of learning theory that I half-understand, at least at an intuitive level, thanks to its reliance on KL divergences and the mutual information.

A simple guessing game

Let's start this with a simple thought experiment to illustrate why and how mutual information may be useful in describing an algorithm's ability to generalize. Say we're given two datasets, $\mathcal{D}_{train}$ and $\mathcal{D}_{test}$, of the same size for simplicity. We play the following game: we both have access to $\mathcal{D}_{train}$ and $\mathcal{D}_{test}$, and we both know what learning algorithm, $\operatorname{Alg}$ we're going to use.

Now I toss a coin and I keep the result (recorded as random variable $Y$) a secret. If it's heads, I run $\operatorname{Alg}$ on the training set $\mathcal{D}_{train}$. If it's tails, I run $\operatorname{Alg}$ on the test data $\mathcal{D}_{test}$ instead. I don't tell you which of these I did, I only reveal to you the final parameter value $W$. Can you guess, just by looking at $W$, whether I trained on training or test data?

If you cannot guess $Y$, that means that the algorithm returns the same random $W$ irrespective of whether you train it on training or test data. So the training and test losses become interchangeable. This implies that the algorithm will generalize very well (on average) and not overfit to the data it's trained on.

The mutual information, in this case between $W$ and $Y$ quantifies your theoretical ability to guess $Y$ from $W$. The higher this value is, the easier it is to tell which dataset the algorithm was trained on. If it's easy to reverse engineer my coin toss from parameters, it means that the algorithm's output is very sensitive to the input dataset it's trained on. And that likely implies poor generalization.

Note by: an algorithm generalizing well on average doesn't mean it works well on average. It just means that there won't be a large gap between the expected training and expected test error. Take for example an algorithm returns a randomly initialized neural network, without even touching the data. That algorithm generalizes extremely well on average: it does just as poorly on test data as it does on training data.

Illustrating this in more detail

Below is an illustration of my thought experiment for SGD.

In the top row, I doodled the distribution of the parameter $W_t$ at various timesteps $t=0,1,2,\ldots,T$ of SGD. We start the algorithm by initializing $W$ randomly from a Gaussian (left panel). Then, each stochastic gradient update changes the distribution of $W_t$ a bit compared to the distribution of $W_{t-1}$. How the shape of the distribution changes depends on the data we use in the SGD steps. In the top row, let's say I ran SGD on $\mathcal{D}_{train}$ and in the bottom, I run it on $\mathcal{D}_{test}$. The distibutions $p(W_t\vert \mathcal{D})$ I drew here describe where the SGD iterate is likely to be after $t$ steps of SGD started from random initialization. They are not to be confused with Bayesian posteriors, for example.

We know that running the algorithm on the test set would produce low test error. Therefore, sampling a weight vector $W$ from $p(W_T\vert \mathcal{D}_{test})$ would be great if we could do that. But in practice, we can't train on the test data, all we have the ability to sample from is $p(W_T\vert \mathcal{D}_{train})$. So what we'd like for good generalization, is if $p(W_T\vert \mathcal{D}_{test})$ and $p(W_T\vert \mathcal{D}_{train})$ were as close as possible. The mutual information between $W_T$ and my coinflip $Y$ measures this closeness in terms of the Jensen-Shannon divergence:

$$
\mathbb{I}[Y, W_T] = \operatorname{JSD}[p(W_T\vert \mathcal{D}_{test})\|p(W_T\vert \mathcal{D}_{train})]
$$

So, in summary, if we can guarantee that the final parameter an algorithm comes up with doesn't reveal too much information about what dataset it was trained on, we can hope that the algorithm has good generalization properties.

Mutual Inforrmation-based Generalization Bounds

These vague intuitions can be formalized into real information-theoretic generalization bounds. These were first presented in a general context in (Russo and Zou, 2016) and in a more clearly machine learning context in (Xu and Raginsky, 2017). I'll give a quick - and possibly somewhat handwavy - overview of the main results.

Let $\mathcal{D}$ and $\mathcal{D}'$ be random datasets of size $n$, drawn i.i.d. from some underlying data distribution $P$. Let $W$ be a parameter vector, which we obtain by running a learning algorithm $\operatorname{Alg}$ on the training data $\mathcal{D}$: $W = \operatorname{Alg}(\mathcal{D})$. The algorithm may be non-deterministic, i.e. it may output a random $W$ given a dataset. Let $\mathcal{L}(W, \mathcal{D})$ denote the loss of model $W$ on dataset $\mathcal{D}$. The expected generalization error of $\operatorname{Alg}$ is defined as follows:

$$
\text{gen}( \operatorname{Alg}, P) = \mathbb{E}_{\mathcal{D}\sim P^n,\mathcal{D}'\sim P^n, W\vert \mathcal{D}\sim \operatorname{Alg}(\mathcal{D})}[\mathcal{L}(W, \mathcal{D}') - \mathcal{L}(W, \mathcal{D})]
$$

If we unpack this, we have two datasets $\mathcal{D}$ and $\mathcal{D}'$, the former taking the role of the training dataset, the latter of the test data. We look at the expected difference between the training and test losses ($\mathcal{L}(W, \mathcal{D})$ and $\mathcal{L}(W, \mathcal{D}')$), where $W$ is obtained by running $\operatorname{Alg}$ on the training data $\mathcal{D}$. The expectation is taken over all possible random training sets, test sets, and over all possible random outcomes of the learning algorithm.

The information theoretic bound states that for any learning algorithm, and any loss function that's bounded by $1$, the following inequality holds:

$$
gen(\operatorname{Alg}, P) \leq \sqrt{\frac{\mathbb{I}[W, \mathcal{D}]}{n}}
$$

The main term in the RHS of this bound is the mutual infomation between the training data \mathcal{D} and the pararmeter vector $W$ the algorithm finds. It essentially quantifies the number of bits of information the algorithm leaks about the training data into the parameters it learns. The lower this number, the better the algorithm generalizes.

Why we can't apply this to SGD?

The problem with applying these nice, intuitive bounds to SGD is that SGD, in fact, leaks too much information about the specific minibatches it is trained on. Let's go back to my illustrative example of having to guess if we ran the algorithm on training or test data. Consider the scenario where we start form some parameter value $w_t$ and we update either with a random minibatch of training data (blue) or a random minibatch of test data (orange).

Since the training and test datasets are assumed to be of finite size, there are only a finite number of possible minibatches. Each of these minibatches can take the parameter to a unique new location. The problem is, the set of locations you can reach with one dataset (blue dots) does not overlap with the set of locations you can reach if you update with the other dataset (orange dots). Suddenly, if I give you $w_{t+1}$, you can immediately tell if it's an orange or blue dot, so you can immediately reconstruct my coinflip $Y$.

In the more general case, the problem with SGD in the context of information-theoretic bounds is that the amount of information SGD leaks about the dataset it was trained on is high, and in some cases may even be infinite. This is actually related to the problem that several of us noticed in the context of GANs, where the true and fake distributions may have non-overlapping support, making the KL divergence infinite, and saturating out the Jensen-Shannon divergence. The first trick we came up with to solve this problem was to smooth things out by adding Gaussian noise. Indeed, adding noise is key what researches have been doing to apply these information-theoretic bounds to SGD.

Adding noise to SGD

The first thing people did (Pensia et al, 2018) is to study a noisy cousin of SGD: stochastic gradient Langevin dynamics (SGLD). SGDL is like SGD but in each iteration we add a bit of Gaussian noise to the parameters in addition to the gradient update. To understand why SGLD leaks less information, consider the previous example with the orange and blue point clouds. SGLD makes those point clouds overlap by convolving them with Gaussian noise.

However, SGLD is not exactly SGD, and it's not really used as much in practice. In order to say something about SGD specifically, Neu (2021) did something else, while still relying on the idea of adding noise. Instead of baking the noise in as part of the algorithm, Neu only adds noise as part of the analysis. The algorithm being analysed is still SGD, but when we measure the mutual information we will measure the mutual information between $\mathbb{I}[W + \xi; \mathcal{D}]$, where $\xi$ is Gaussian noise.

I leave it to you to check out the details of the paper. While the findings fall short of explaining whether SGD have any tendency to find solutions that generalise well, some of the results are nice and interpretable: they connect the generalization of SGD to the noisiness of gradients as well as the smoothness of the loss along the specific optimization path that was taken.

Notes on the Origin of Implicit Regularization in SGD

Ferenc Huszar — Thu, 01 Apr 2021 14:23:24 GMT

I wanted to highlight an intriguing paper I presented at a journal club recently:

Samuel L Smith, Benoit Dherin, David Barrett, Soham De (2021) On the Origin of Implicit Regularization in Stochastic Gradient Descent

There's actually a related paper that came out simultaneously, studying full-batch gradient descent instead of SGD:

David G.T. Barrett, Benoit Dherin (2021) Implicit Gradient Regularization

One of the most important insights in machine learning over the past few years relates to the importance of optimization algorithms in generalization performance.

Why deep learning works at all

In order to understand why deep learning works as well as it does, it is insufficient to reason about the loss function or the model class, which is what classical generalisation theory focussed on. Instead, the algorithms we use to find minima (namely, stochastic gradient descent) seem to play an important role. In many tasks, powerful neural networks are able to interpolate training data, i.e. achieve near-0 training loss. There are in fact several minima of the training loss which are virtually indistinguishably good on the training data. Some of these minima generalise well (i.e. result in low test error), others can be arbitrarily badly overfit.

What seems to be important then is not whether the optimization algorithm converges quickly to a local minimum, but which of the available "virtually global" minima it prefers to reach. It seems to be the case that the optimization algorithms we use to train deep neural networks prefer some minima over others, and that this preference results in better generalisation performance. The preference of optimization algorithms to converge to certain minima while avoiding others is described as implicit regularization.

I wrote this note as an overview on how we/I currently think about why deep networks generalize.

Analysing the effect of finite stepsize

One of the interesting new theories that helped me imagine what happens in deep learning training is that of neural tangent kernels. In this framework we study neural network training in the limit of infinitely wide layers, full-batch training and infinitesimally small learning rate, i.e. when gradient becomes continuous gradient flow, described by an ordinary differential equation. Although the theory is useful and appealing, full-batch training with infinitesimally small learning rates is very much a cartoon version of what we actually do in practice. In practice, the smallest learning date doesn't always work best. Secondly, the stochasticity of gradient updates in minibatch-SGD seems to be of importance as well.

What Smith et al (2021) do differently in this paper is they try to study minibatch-SGD, for small, but not infinitesimally small, learning rates. This is much closer to practice. The toolkit that allows them to study this scenario is borrowed from the study of differential equations and is called backward error analysis. The cartoon illustration below shows what backward error analysis tries to achieve:

Let's say we have a differential equation $\dot{\omega} = f(\omega)$. The solution to this ODE with initial condition $\omega_0$ is a continuous trajectory $\omega_t$, shown in the image in black. We usually can't compute this solution in closed form, and instead simulate the ODE using the Euler's method, $\omega_{k+1} = \omega_k + \epsilon f(\omega_k)$. This results in a discrete trajectory shown in teal. Due to discretization error, for finite stepsize $\epsilon$, this discrete path may not lie exactly where the continuous black path lies. Errors accumulate over time, as shown in this illustration. The goal of backward error analysis is to find a different ODE, $\dot{\omega} = \tilde{f}(\omega)$ such that the approximate discrete path we got from Euler's method lieas near the the continuous path which solves this new ODE. Our goal is to reverse engineer a modified $\tilde{f}$ such that the discrete iteration can be well-modelled by an ODE.

Why is this useful? Because the form $\tilde{f}$ takes can reveal interesting aspects of the behaviour of the discrete algorithm, particularly if it has any implicit bias towards moving into different areas of the space. When the authors apply this technique to (full-batch) gradient descent, it already suggests the kind of implicit regularization bias gradient descent has.

In Gradient descent with a cost function $C$, the original ODE is $f(\omega) = -\nabla C (\omega)$. The modified ODE which corresponds to a finite stepsize $\epsilon$ takes the form $\dot{\omega} = -\nabla\tilde{C}_{GD}(\omega)$ where

$$
\tilde{C}_{GD}(\omega) = C(\omega) + \frac{\epsilon}{4} \|\nabla C(\omega)\|^2
$$

So, gradient descent with finite stepsize $\epsilon$ is like running gradient flow, but with an added penalty that penalises the gradients of the loss function. The second term is what Barret and Dherin (2021) call implicit gradient regularization.

Stochastic Gradients

Analysing SGD in this framework is a bit more difficult because the trajectory in stochastic gradient descent is, well, stochastic. Therefore, you don't have have a single discrete trajectory to optimize, but instead you have a distribution of different trajectories which you'd traverse if you randomly reshuffle your data. Here's a picture illustrating this situation:

Starting from the initial point $\omega_0$ we now have multiple trajectories. These correspond to different ways we can shuffle data (in the paper we assume we have a fixed allocation of datapoints to minibatches, and the randomness comes from the order in which the minibatches are considered). The two teal trajectories illustrate two potential paths. The paths end up at a random location, the teal dots show additional random endpoints where trajectories may end up at. The teal star shows the mean of the distribution of random trajectory endpoints.

The goal in (Smith et al, 2021) is to reverse-engineer an ODE so that the continuous (orange) path lies close to this mean location. The corresponding ODE is of the form $\dot{omega} = -\nabla C_{SGD}(\omega)$, where

$$
\tilde{C}_{SGD}(\omega) = C(\omega) + \frac{\epsilon}{4m} \sum_{k=1}^{m} \|\nabla \hat{C}_k(\omega)\|^2,
$$

where $\hat{C}_k$ is the loss function on the $k^{th}$ minibatch. There are $m$ minibatches in total. Note that this is similar to what we had for gradient descent, but instead of the norm of the full-batch gradient we now have the average norm of minibatch gradients as the implicit regularizer. Another interesting view on this is to look at the difference between the GD and SGD regularizers:

$$
\tilde{C}_{SGD} = \tilde{C}_{GD} + \frac{\epsilon}{4m} \sum_{k=1}^{m} \|\nabla \hat{C}_k(\omega) - C(\omega)\|^2
$$

This additional regularization term, $\frac{1}{m}\sum_{k=1}^{m} \|\nabla \hat{C}_k(\omega) - C(\omega)\|^2$, is something like the total variance of minibatch gradients (the trace of the empirical Fisher information matrix). Intuitively, this regularizer term will avoid parts of the parameter-space where the variance of gradients calculated over different minibatches is high.

Importantly, while $C_{GD}$ has the same minima as $C$, this is no longer true for $C_{SGD}$. Some minima of $C$ where the variance of gradients is high, is no longer a minimum of $C_{SGD}$. As an implication, not only does SGD follow different trajectories than full-batch GD, it may also converge to completely different solutions.

As a sidenote, there are many versions of SGD, based on how data is sampled for the gradient updates. Here, it is assumed that the datapoints are assigned to minibatches, but then the minibatches are randomly sampled. This is different from randomly sampling datapoints with replacement from the training data (Li et al (2015) consider that case), and indeed an analysis of that variant may well lead to different results.

Connection to generalization

Why would an implicit regularization effect avoiding high minibatch gradient variance be useful for generalisation? Well, let's consider a cartoon illustration of two local minima below:

Both minima are the same as much as the average loss $C$ is concerned: the value of the minimum is the same, and the width of the two minima are the same. Yet, in the left-hand situation, the wide minimum arises as the average of several minibatch losses, which all look the same, and which all are relatively wide themselves. In the right-hand minimum, the wide average loss minimum arises as the average of a lot of sharp minibatch losses, which all disagree on where exactly the location of the minimum is.

If we have these two options, it is reasonable to expect the left-hand minimum to generalise better, because the loss function seems to be less sensitive to whichever specific minibatch we are evaluating it on. As a consequence, the loss function also may be less sensitive to whether a datapoint is in the training set or in the test set.

Summary

In summary, this paper is a very interesting analysis of stochastic gradient descent. While it has its limitations (which the authors don't try to hide and discuss transparently in the paper), it nevertheless contributes a very interesting new technique for analysing optimization algorithms with finite stepsize. I found the paper to be well-written, with the explanation of somewhat tedious details of the analysis clearly laid out. But perhaps I liked this paper most because it confirmed my intuitions about why SGD works, and what type of minima it tends to prefer.

An information maximization view on the $\beta$-VAE objective

Dora Jambor — Thu, 18 Mar 2021 15:18:12 GMT

guest post with Dóra Jámbor

This is a half-guest-post written jointly with Dóra, a fellow participant in a reading group where we recently discussed the original paper on $\beta$-VAEs:

Irina Higgins et al (ICLR 2017): $\beta$-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework

On the surface of it, $\beta$-VAEs are a straightforward extension of VAEs where we are allowed to directly control the tradeoff between the reconstruction and KL loss terms. In an attempt to better understand where the $\beta$-VAE objective comes from, and to further motivate why it makes sense, here we derive $\beta$-VAEs from different first principles than it is presented in the paper. Over to mostly Dóra for the rest of this post:

First, some notation:

$p_\mathcal{D}(\mathbf{x})$: data distribution
$q_\psi(\mathbf{z}\vert \mathbf{x})$: representation distribution
$q_\psi(\mathbf{z}) = \int p_\mathcal{D}(\mathbf{x})q_\psi(\mathbf{z}\vert \mathbf{x})$: aggregate posterior - marginal distribution of representation $Z$
$q_\psi(\mathbf{x}\vert \mathbf{z}) = \frac{q_\psi(\mathbf{z}\vert \mathbf{x})p_\mathcal{D}(\mathbf{x})}{q_\psi(\mathbf{z})}$: "inverted posterior"

Motivation and assumptions of $\beta$-VAEs

Learning disentangled representations that recover the independent data generative factors has been a long-term goal for unsupervised representation learning.

$\beta$-VAEs were introduced in 2017 with a proposed modification to the original VAE formulation that can achieve better disentanglement in the posterior $q_\psi(\mathbf{z}\vert \mathbf{x})$. An assumption of $\beta$-VAEs is that there are two sets of latent factors, $\mathbf{v}$ and $\mathbf{w}$, that contribute to generating observations $x$ in the real world. One set, $\mathbf{v}$, is coordinate-wise conditionally independent given the observed variable, i.e., $\log p(\mathbf{v}\vert x) = \sum_k \log p(v_k\vert \mathbf{x})$. At the same time, we don't assume anything about the remaining factors $\mathbf{w}$.

The factors $v$ are going to be the main object of interest for us. The conditional independence assumption allows us to formulate what it means to disentangle these factors of variation. Consider a representation $\mathbb{z}$ which entangles coordinates of $v$, in that each coordinate of $\mathbb{z}$ depends on multiple coordinates of $\mathbb{v}$, e.g. $z_1 = f_1(v_1, v_2)$ and $z_2 = f_2(v_1, v_2)$. Such a $\mathbb{z}$ won't necessarily satisfy co-ordinatewise conditional independence $\log p(\mathbf{z}\vert x) = \sum_k \log p(z_k\vert \mathbf{x})$. However, if each component of $\mathbb{z}$ depended only on one corresponding coordinate of $\mathbf{v}$, for example $z_1 = g_1(v_1)$ and $z_2 = g_2(v_2)$, the component-wise conditional independence would hold for $\mathbb{z}$ too.

Thus, under these assumptions we can encourage disentanglement to happen by encouraging the posterior $q_\psi(\mathbf{z}\vert \mathbf{x})$ to be coordinate-wise conditionally independent. This can be done by adding a new hyperparameter $\beta$ to the original VAE formulation

$$
\mathcal{L}(\theta, \phi; x, z, \beta) = -\mathbb{E}_{q_{\phi}(x\vert z)p_\mathcal{D}(x)}[\log p_{\theta}(x \vert z)] + \beta \operatorname{KL} (q_{\phi(z\vert x)}\| p(z)),
$$

where $\beta$ controls the trade-off between the capacity of the latent information channel and learning conditionally independent latent factors. When $\beta$ is higher than 1, we encourage the posterior $q_\psi(z\vert x)$ to be close to the isotropic unit Gaussian $p(z) = \mathcal{N}(0, I)$, which itself is coordnate-wise independent.

Marginal versus Conditional Independence

In this post, we revisit the conditional independence assumption of latent factors, and argue that a more appropriate objective would be to have marginal independence in the latent factors. To show you our intuition, let's revisit the "Explaining Away" phenomenon from Probabilistic Graphical Models.

Explaining away

Consider three random variables:
$A$: Ferenc is grading exams
$B$: Ferenc is in a good mood
$C$: Ferenc is tweeting a meme

with the following graphical model $A \rightarrow C \leftarrow B$.

Here we could assume that Ferenc grading exams is independent of him being in a good mood, i.e., $A \perp B$. However, the pressure of marking exams results in increased likelihood of procrastination, which increases the chances of tweeting memes, too.

However, as soon as we see a meme being tweeted by him, we know that he either in a good mood or he is grading exams. If we know he is grading exams, that explains why he is tweeting memes, so it's less likely he's tweeting memes because he's a good mood. Consequently, $A \not!\perp B\vert C$.

In all seriousness, if we have a graphical model $A \rightarrow C \leftarrow B$, in evidence of $C$, independence between $A$ and $B$ no longer holds.

Why does this matter?

We argue that the explaining away phenomenon makes the conditional independence of latent factors undesirable. A much more reasonable assumption about the generative process of the data is that the factors of variation $\mathbf{v}$ are drawn independently, and then the observations are generated conditoned on them. However, if we consider two coordinates of $\mathbb{v}$ and the observation $\mathbf{x}$, we now have a $V_1 \rightarrow \mathbf{X} \leftarrow V_2$ graphical model, thus, conditional independence cannot hold.

Instead, we argue that to recover the generative factors of the data, we should encourage latent factors to be marginally independent. In the next section, we set out to derive an algorithm that encourages marginal independence in the representation Z. We will also show how the resulting loss function from this new derivation is actually equivalent to the original $\beta$-VAEs formulation.

Marginally Independent Latent Factors

We'll start from desired properties of the representation distribution $q_\psi(z\vert x)$. We'd like this representation to satisfy two properties:

Marginal independence: We would like the aggregate posterior $q_\psi(z)$ to be close to some fixed and factorized unit Gaussian prior $p(z) = \prod_i p(z_i)$. This encourages $q_\psi(z)$ to exhibit coordinate-wise independence.
Maximum Information: We'd like the representation $Z$ to retain as much information as possible about the input data $X$.

Note that without (1), (2) is insufficient, because then any deterministic and invertible function of $X$ would contain maximum information about $X$ but that wouldn't make it a useful or disentangled representation. Similarly, without (2), (1) is insufficient because if we set $q_\psi(z\vert x) = p(z)$ it would give us a latent representation Z that is coordinate-wise independent, but it is also independent of the data which is not very useful.

Deriving a practical objective

We can achieve a combination of these desiderata by optimizing an objective with the weighted combination of two terms corresponding to the two goals we set out above:

$$
\mathcal{L}(\psi) = \operatorname{KL}[q_\psi(z)| p(z)] - \lambda \mathbb{I}_{q_\psi(z\vert x) p_\mathcal{D}(x)}[X, Z]
$$

Remember, we use $q_\psi(z)$ to denote the aggregate posterior. We will refer to this as the InfoMax objective. Now we're going to show how this objective can be related to the $\beta$-VAE objective. Let's first consider the KL term in the above objective:

\begin{align}
\operatorname{KL}[q_\psi(z)| p(z)] &= \mathbb{E}_{q_\psi(z)} \log \frac{q_\psi(z)}{p(z)}\\
&= \mathbb{E}_{q_\psi(z\vert x)p_\mathcal{D}(x)} \log \frac{q_\psi(z)}{p(z)}\\
&= \mathbb{E}_{q_\psi(z\vert x)p_\mathcal{D}(x)} \log \frac{q_\psi(z)}{q_\psi(z\vert x)} + \mathbb{E}_{q_\psi(z\vert x)p_\mathcal{D}(x)} \log \frac{q_\psi(z\vert x)}{p(z)}\\
&= \mathbb{E}_{q_\psi(z\vert x)p_\mathcal{D}(x)} \log \frac{q_\psi(z)p_\mathcal{D}(x)}{q_\psi(z\vert x)p_\mathcal{D}(x)} + \mathbb{E}_{q_\psi(z\vert x)p_\mathcal{D}(x)} \log \frac{q_\psi(z\vert x)}{p(z)}\\
&= -\mathbb{I}_{q_\psi(z\vert x)p_\mathcal{D}(x)}[X,Z] + \mathbb{E}_{p_\mathcal{D}}\operatorname{KL}[q_\psi(z\vert x)| p(z)]
\end{align}

This is interesting. If the mutual information between $X$ and $Z$ is non-zero (which is ideally the case), the above equation shows that latent factors cannot be both marginally and conditionally independent at the same time. It also gives us a way to relate the KL terms representing marginal and conditional independence.

Putting this back into the InfoMax objective, we have that

\begin{align}
\mathcal{L}(\psi) &= \operatorname{KL}[q_\psi(z)| p(z)] - \lambda \mathbb{I}_{q_\psi(z\vert x)p_\mathcal{D}(x)}[X, Z]\\
&= \mathbb{E}_{p_\mathcal{D}}\operatorname{KL}[q_\psi(z\vert x)| p(z)] - (\lambda + 1) \mathbb{I}_{q_\psi(z\vert x)p_\mathcal{D}(x)}[X, Z]\
\end{align}

Using the KL term in the InfoMax objective, we were able to recover the KL-divergence term that also appears in the $\beta$-VAE (and consequently, VAE) objective.

At this point, we still haven't defined the generative model $p_\theta(x\vert z)$, the above objective expresses everything in terms of the data distribution $p_\mathcal{D}$ and the posterior/representation distribution $q_\psi$.

We will now focus on the 2nd term in our desired objective, the weighted mutual information, which we still can't easily evaluate. We will now show that we can recover the reconstruction term in $\beta$-VAEs by doing a variational approximation to the mutual information.

Variational bound on mutual information

Note the following equality:

\begin{equation}
\mathbb{I}[X,Z] = \mathbb{H}[X] - \mathbb{H}[X\vert Z]
\end{equation}

Since we sample X from the data distribution $p_\mathcal{D}$, we see that the first term $\mathbb{H}[X]$, the entropy of $X$, is constant with respect to the variational parameter $\psi$. We are left to focus on finding a good approximation to the second term $\mathbb{H}[X\vert Z]$. We can do so by minimizing the KL divergence between $q_\psi(x\vert z)$ and an auxilliary distribution $p_\theta(x\vert z)$ to make a variational appoximation to the mutual information:

$$\mathbb{H}[X\vert Z] = - \mathbb{E}_{q\psi(z\vert x)p_\mathcal{D}(x)} \log q_\psi(x\vert z) \leq \inf_\theta - \mathbb{E}_{q\psi(z\vert x)p_\mathcal{D}(x)} \log p_\theta(x\vert z)$$

Putting this bound back together:

Finding this lower bound to MI, we have now recovered the reconstruction term from the $\beta$-VAE objective:

$$
\mathcal{L}(\psi) + \text{const} \leq - (1 + \lambda) \mathbb{E}_{q_\psi(z\vert x)p_\mathcal{D}(x)} \log p_\theta(x\vert z) + \mathbb{E}_{p_\mathcal{D}}\operatorname{KL}[q_\psi(z\vert x)| p(z)]
$$

This is essentially the same as the $\beta$-VAE objective function, where $\beta$ is related to our previous $\lambda$. In particular, $\beta = \frac{1}{1 + \lambda}$. Thus, since we assumed $\lambda>0$ for the InfoMax objective to make sense, we can say that the $\beta$-VAE objective encourages disentanglement in the InfoMax sense for values of $0<\beta<1$.

Takeaways

Conceptually, this derivation is interesting because the main object of interest is now the recognition model, $q_\psi(z\vert x)$. That is, the posterior becomes a the focus of the objective function - something that is not the case when we are maximizing model likelihood alone (as explained here). In this respect, this derivation of the $\beta$-VAE makes more sense from a representation learning viewpoint than the derivation of VAE from maximum likelihood.

There is a nice symmetry to these two views. There are two joint distributions over latents and observable variables in a VAE. On one hand we have $q_\psi(z\vert x)p_\mathcal{D}(x)$ and on the other we have $p(x)p_\theta(x\vert z)$. The "latent variable model" $q_\psi(z\vert x)p_\mathcal{D}(x)$ is a family of LVMs which has a marginal distribution on observable $\mathbf{x}$ that is exactly the same as the data distribution $p_\mathcal{D}$. So one can say $q_\psi(z\vert x)p_\mathcal{D}(x)$ is a parametric family of latent variable models with whose likelihood is maximal - and we want to choose from this family a model where the representation $q_\psi(z\vert x)$ has nice properties.

On the flipside, $p(z)p_\theta(x\vert z)$ is a parametic set of models where the marginal distribution of latents is coordinatewise independent, but we would like to choose from this family a model that has good data likelihood.

The VAE objective tries to move these two latent variable models closer to one another. From the perspective of $q_\psi(z\vert x)p_\mathcal{D}(x)$ this amounts to reproducing the prior $p(z)$ with the aggregate posterior. from the perspective of $p(z)p_\theta(x\vert z)$, it amounts to maximising the data likelihood. When the $\beta$-VAE objective is used, we additionally wish to maximise the mutual information between the observed data and the representation.

This dual role of information maximization and maximum likelihood has been pointed out before, for example in this paper about the IM algorithm. The symmetry of variational learning has been exploited a few times, for example in yin-yang machines, and more recently also in methods like adversarially learned inference.

Some Intuition on the Neural Tangent Kernel

Ferenc Huszar — Fri, 20 Nov 2020 13:57:12 GMT

Neural tangent kernels are a useful tool for understanding neural network training and implicit regularization in gradient descent. But it's not the easiest concept to wrap your head around. The paper that I found to have been most useful for me to develop an understanding is this one:

Lee et al (2019) Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

In this post I will illustrate the concept of neural tangent kernels through a simple 1D regression example. Please feel free to peruse the google colab notebook I used to make these plots.

Example 1: Warming up

Let's start from a very boring case begin with. Let's say we have a function defined over integers between -10 and 20. We parametrize our function as a look-up table, that is the value of the function $f(i)$ at each integer $i$ is described by a separate parameter $\theta_i = f(i)$. I'm initializing the parameters of this function as $\theta_i = 3i+2$. The function is shown by the black dots below:

Now, let's consider what happens if we observe a new datapoint, $(x, y) =(10, 50)$, shown by the blue cross. We're going to take a gradient descent step updating $\theta$. Let's say we use the squared error loss function $(f(10; \theta) - 50)^2$ and a learning rate $\eta=0.1$. Because the function's value at $x=10$ only depends on one of the parameter $\theta_10$, only this parameter will be updated. The rest of the parameters, and therefore the rest of the function values remain unchanged. The red arrows illustrate the way function values move in a single gradient descent step: Most values don't move at all, only one of them moves closer to the observed data. Hence only one visible red arrow.

However, in machine learning we rarely parametrize functions as lookup tables of individual function values. This parametrization is pretty useless as it doesn't allow you to interpolate let alone extrapolate to unseen data. Let's see what happens in a more familiar model: linear regression.

Example 2: Linear function

Let's now consider the linear function $f(x, \theta) = \theta_1 x + \theta_2$. I initialize the parameters to $\theta_1=3$ and $\theta_2=1$, so at initialisation, I have exactly the same function over integers as I had in the first example. Let's look at what happens to this function as I update $\theta$ by performing single gradient descent step incorporating the observation $(x, y) =(10, 50)$ as before. Again, red arrows are show how function values move:

Whoa! What's going on now? Since individual function values are no longer independently parametrized, we can't move them independently. The model binds them together through its global parameters $\theta_1$ and $\theta_2$. If we want to move the function closer to the desired output $y=50$ at location $x=10$ the function values elsewhere have to change, too.

In this case, updating the function with an observation at $x=10$ changes the function value far away from the observation. It even changes the function value in the opposite direction than what one would expect.. This might seem a bit weird, but that's really how linear models work.

Now we have a little bit of background to start talking about this neural tangent kernel thing.

Meet the neural tangent kernel

Given a function $f_\theta(x)$ which is parametrized by $\theta$, its neural tangent kernel $k_\theta(x, x')$ quantifies how much the function's value at $x$ changes as we take an infinitesimally small gradient step in $\theta$ incorporating a new observation at $x'$. Another way of phrasing this is: $k(x, x')$ measures how sensitive the function value at $x$ is to prediction errors at $x'$.

In the plots before, the size of the red arrows at each location $x$ were given by the following equation:

$$
\eta \tilde{k}_\theta(x, x') = f\left(x, \theta + \eta \frac{f_\theta(x')}{d\theta}\right) - f(x, \theta)
$$

In neural network parlance, this is what's going on: The loss function tells me to increase the function value $f_\theta(x')$. I back-propagate this through the network to see what change in $\theta$ do I have to make to achieve this. However, moving $f_\theta(x')$ this way also simultaneously moves $f_\theta(x)$ at other locations $x \neq x'$. $\tilde{k}_\theta(x, x')$ expresses by how much.

The neural kernel is basically something like the limit of $\tilde{k}$ in as the stepsize becomes infinitesimally small. In particular:

$$
k(x, x') = \lim_{\eta \rightarrow 0} \frac{f\left(x, \theta + \eta \frac{df_\theta(x')}{d\theta}\right) - f(x, \theta)}{\eta}
$$

Using a 1st order Taylor expansion of $f_\theta(x)$, it is possible to show that

$$
k_\theta(x, x') = \left\langle \frac{df_\theta(x)}{d\theta} , \frac{f_\theta(x')}{d\theta} \right\rangle
$$

As homework for you: find $k(x, x')$ and/or $\tilde{k}(x, x')$ for a fixed $\eta$ in the linear model in the pervious example. Is it linear? Is it something else?

Note that this is a different derivation from what's in the paper (which starts from continuous differential equation version of gradient descent).

Now, I'll go back to the examples to illustrate two more important property of this kernel: sensitivity to parametrization, and changes during training.

Example 3: Reparametrized linear model

It's well known that neural networks can be repararmetized in ways that don't change the actual output of the function, but which may lead to differences in how optimization works. Batchnorm is a well-known example of this. Can we see the effect of reparametrization in the neural tangent kernel? Yes we can. Let's look at what happens if I reparametrize the linear function I used in the second example as:

$$
f_\theta(x) = \theta_1 x + \color{blue}{10\cdot}\theta_2
$$

but now with parameters $\theta_1=3, \theta_2=\color{blue}{0.1}$. I highlighted in blue what changed. The function itself, at initialization is the same since $10 * 0.1 = 1$. The function class is the same, too, as I can still implement arbitrary linear functions. However, when we look at the effect of a single gradient step, we see that the function changes differently when gradient descent is performed in this parametrisation.

In this parametization, it became easier for gradient descent to push the whole function up by a constant, while in the previous parametrisation it decided to change the slope. What this demonstrates is that the neural tangent kernel $k_\theta(x, x')$ is sensitive to reparametrization.

Example 4: tiny radial basis function network

While the linear models may be good illustration, let's look at what $k_\theta(x, x')$ looks like in a nonlinear model. Here, I'll consider a model with two squared exponential basis functions:

$$
f_\theta(x) = \theta_1 \exp\left(-\frac{(x - \theta_2)^2}{30}\right) + \theta_3 \exp\left(-\frac{(x - \theta_4)^2}{30}\right) + \theta_5,
$$

with initial parameter values $(\theta_1, \theta_2, \theta_3, \theta_4, \theta_5) = (4.0, -10.0, 25.0, 10.0, 50.0)$. These are chosen somewhat arbitrarily and to make the result visually appealing:

We can visualise the function $\tilde{k}_\theta(x, 10)$ directly, rather than plotting it on top the function. Here I also normalize it by dividing by $\tilde{k}_\theta(10, 10)$.

What we can see is that this starts to look a bit like a kernel function in that it has higher values near $10$ and decreases as you go farther away. However, a few things are worth noting: the maximum of this kernel function is not at $x=1o$, but at $x=7$. It means, that the function value $f(7)$ changes more in reaction to an observation at $x'=10$ than the value $f(10)$. Secondly, there are some negative values. In this case the previous figure provides a visual explanation why: we can increase the function value at $x=10$ by pushing the valley centred at $\theta_1=4$ away from it, to the left. This parameter change in turn decreases function values on the left-hand wall of the valley. Third, the kernel function converges to a positive constant at its tails - this is because of the offset $\theta_5$.

Example 5: Changes as we train

Now I'm going to illustrate another important property of the neural tangent kernel: in general, the kernel depends on the parameter value $\theta$, and therefore it changes as the model is trained. Here I show what happens to the kernel as I take 15 gradient ascent steps trying to increase $f(10)$. The purple curve is the one I had at initialization (above), and the orange ones show the kernel at the last gradient step.

The corresponding changes to the function $f_\theta_t$ changes are shown below:

So we can see that as the parameter changes, the kernel also changes. The kernel becomes flatter. An explanation of this is that eventually we reach a region of parameter space, where $\theta_4$ changes the fastest.

Why is this interesting?

It turns out the neural tangent kernel becomes particularly useful when studying learning dynamics in infinitely wide feed-forward neural networks. Why? Because in this limit, two things happen:

First: if we initialize $\theta_0$ randomly from appropiately chosen distributions, the initial NTK of the network $k_{\theta_0}$ approaches a deterministic kernel as the width increases. This means, that at initialization, $k_{\theta_0}$ doesn't really depend on $\theta_0$ but is a fixed kernel independent of the specific initialization.
Second: in the infinite limit the kernel $k_{\theta_t}$ stays constant over time as we optimise $\theta_t$. This removes the parameter dependence during training.

These two facts put together imply that gradient descent in the infinitely wide and infinitesimally small learning rate limit can be understood as a pretty simple algorithm called kernel gradient descent with a fixed kernel function that depends only on the architecture (number of layers, activations, etc).

These results, taken together with an older known result by Neal, (1994), allows us to characterise the probability distribution of minima that gradient descent converges to in this infinite limit as a Gaussian process. For details, see the paper mentioned above.

Don't mix your kernels

There are two somewhat related sets of results both involving infinitely wide neural netwoks and kernel functions, so I just wanted to clarify the difference between them:

the older, well-known result by Neal, (1994), later extended by others, is that the distribution of $f_\theta$ under random initialization of $\theta$ converges to a Gaussian process. This Gaussian process has a kernel or covariance function which is not, in general, the same as the neural tangent kernel. This old result doesn't say anything about gradient descent, and is typically used to motivate the use of Gaussian process-based Bayesian methods.
the new, NTK, result is that the evolution of $f_{\theta_t}$ during gradient descent training can be described in terms of a kernel, the neural tangent kernel, and that in the infinite limit this kernel stays constant during training and is deterministic at initialization. Using this result, it is possible to show that in some cases the distribution of $f_{\theta_t}$ is a Gaussian process at every timestep $t$, not just at initialization. This result also allows us to identify the Gaussian process which describes the limit as $t \rightarrow \infty$. This limiting Gaussian process however is not the same as the posterior Gaussian process which Neal and others would calculate on the basis of the first result.

So I hope this post helps a bit by building some intuition about what the neural tangent kernel is. If you're interested, check out the simple colab notebook I used for these illustrations.

Notes on Causally Correct Partial Models

Ferenc Huszar — Thu, 12 Nov 2020 14:58:37 GMT

I recently encountered this cool paper in a reading group presentation:

Rezende et al (2020) Rezende Causally Correct Partial Models for Reinforcement Learning

It's frankly taken me a long time to understand what was going on, and it took me weeks to write this half-decent explanation of it. The first notes I wrote followed the logic of the paper more, this in this post I thought I'd just focus on the high level idea, after which hopefully the paper is more straightforward. I wanted to capture the key idea, without the distractions of RNN hidden states, etc, which I found confusing to think about.

POMDP setup

To start with the basics, this paper deals with the partially observed Markov decision process (POMDP) setup. The diagram below illustrates what's going on:

The grey nodes $e_t$ show the unobserved state of the environment at each timestep $t=0,1,2\ldots$. At each timestep the agent observes $y_0$ which depends on the current state of the environment (red-ish nodes). The agent then updates their state $s_t$ based on its past state $s_{t-1}$, the new observation $y_t$, and the previous action taken $a_{t-1}$. This is shown by the blue squares (they're squares, signifying that this node depends deterministically on its parents). Then, based on the agent's state, it chooses an action $a_t$ from by sampling from policy $\pi(a_t\vert s_t)$. The action influences how the environment's state, $e_{t+1}$ changes.

We assume that the agent's ultimate goal is to maximise reward at the last state at time $T$, which we assume is a deterministic function of the observation $r(y_T)$. Think of this reward as the score in an atari game, which is written on the screen whose contents are made available in $y_t$.

The ultimate goal

Let's start by stating what we ultimately would like estimate from the data we have. The assumption is that we sampled the data using some policy $\pi$, but we would like to be able to say how well a different policy $\tilde{\pi}$ would do, in other words, what would be the expected score at time $T$ if instead of $\pi$ we used a different policy $\tilde{\pi}$.

What we are interested in, is a causal/counterfactual query:

$$
\mathbb{E}_{\tau\sim\tilde{p}}[r(y_T)],
$$

where $\tau = [(s_t, y_t, e_t, a_t) : t=0\ldots T]$ denotes a trajectory or rollout up to time $T$, and $\tilde{p}$ denotes the generative process when using policy $\tilde{\pi}$, that is:

$$
\tilde{p}(\tau) = p(e_0)p(y_0\vert e_0) \tilde{\pi}(a_0\vert s_0) p(s_0)\prod_{t=1}^T p(e_t\vert a_{t-1}, e_{t-1}) p(y_t\vert e_t) \tilde{\pi}(a_t\vert s_t) \delta (s_t - g(s_{t-1}, y_t))
$$

I called this a causal or counterfactual query, because we are interested in making predictions under a different distribution $\tilde{p}$ than $p$ which we have observations from. The difference between $\tilde{p}$ and $p$ can be called an intervention, where we replace specific factors in the data generating process with different ones.

There are - at least - two ways one could go about estimating such counterfactual distribution:

model-free, via importance sampling. This method tries to directly estimate the causal query by calculating a weighted average over the observed data. The weights are given by the ratio between $\tilde{\pi}(a_t\vert s_t)$, the probability by which $\tilde{\pi}$ would choose an action and $\pi(a_t\vert s_t)$, the probability it was chosen by the policy that we used to collect the data. A great paper explaining how this works is (Bottou et al, 2013). Importance sampling as the advantage that we don't have to build any model of the environment, we can directly evaluate the average reward from the samples we have, using only $\pi$ and $\tilde{\pi}$ to calculate the weights. The downside, however, is that importance sampling often incredibly high variance estimate, and is only reliable if $\tilde{\pi}$ and $\pi$ are very close.
model-based, via causal calculus. If possible, we can use do-calculus to express the causal query in an alternative way, using various conditional distributions estimated from the observed data. This approach has the disadvantage that it requires us build a model from the data first. We then use the conditional distributions learned from the data to approximate the quantity of interest by plugging them into the formula we got from do-calculus. If our models are imperfect, these imperfections/approximation errors can compound when the causal estimand is calculated, potentially leading to large biases and inaccuracies. On the other hand, our models may be accurate enough to extrapolate to situations where importance weighting would be unreliable.

In this paper, we focus on solving the problem with causal calculus. This requires us to build a model of observed data, which we can then use to make causal predictions. The key question this paper asks is

How much of the data do we have to model to be able to make the kinds of causal inferences we would like to make?

Option 1: model (almost) everything

One way we can answer the query above is to model the joint distribution of everything, or mostly everything, that we can observe. For example, we could build a full autoregressive model of observations $y_t$ conditioned on actions $a_t$. In essence this would amount to fitting a model to $p(y_{0:T}\vert a_{0:T})$.

If we had such model, we would theoretically be able to make causal predictions, for reasons I will explain later. However, this option is ruled out in the paper because we assume the observations $y_t$ are very high dimensional, such as images rendered in a computer game. Thus, modelling the joint distribution of the whole observation sequence $y_{1:T}$ accurately is hopeless and would require a lot of data. Therefore, we would like to get away without modelling the whole observation sequence $y_{1:T}$, which brings us to partial models.

Option 2: partial models

Partial models try to avoid modelling the joint distribution of high-dimensional observations $y_{1:T}$ or agent-state sequences $s_{0:T}$, and focus on modelling directly the distribution of $r(y_T)$ - i.e. only the reward component of the last observation, given the action-sequence $a_{0:T}$. This is clearly a lot easier to do, because $r(y_T)$ is assumed to be a low-dimensional aspect of the full observation $y_T$, so all we have to learn is a model of a scalar conditioned on a sequence of actions $q_\theta(r(y_T)\vert a_{0:T})$. We know very well how to fit such models to realistic amounts of data.

However, if we don't include either $y_t$ or $s_t$ in our model, we won't be able to make the counterfactual inferences we wanted to make in the first place. Why? Let's look at he data generating process once more:

We are trying to model the causal impact of actions $a_0$ and $a_1$ on the outcome $y_2$. Let's focus on $a_1$. $y_2$ is clearly statistically dependent on $a_1$. However, this statistical dependence emerges due to completely different effects:

causal association: $a_1$ influences the state of the environment $e_2$, resulting in an observation $y_2$. Therefore, $a_1$ has an direct causal effect on $y_2$, mediated by $e_2$
spurious association due to confounding: The unobserved hidden state $e_1$ is a confounder between the action $a_1$ and the observation $y_2$. The state $e_1$ has an indirect causal effect on $a_1$ mediated by the observation $y_1$ and the agent's state $s_1$. Similarly $e_1$ has an indirect effect on $y_2$ mediated by $e_2$.

I illustrated these two sources of statistical association by colour-coding the different paths in the causal graph. The blue path is the confounding path: correlation is induced because both $a_1$ and $y_2$ have $e_1$ as causal ancestor. The red path is the causal path: $a_1$ indirectly influences $y_2$ via the hidden state $e_2$.

If we would like to correctly evaluate the consequence of changing policies, we have to be able to disambiguate between these two sources of statistical association, get rid of the blue path, and only take the red path into account. Unfortunately, this is not possible in a partial model, where we only model the distribution of $y_2$ conditional on $a_0$ and $a_1$.

If we want to draw causal inferences, we must model the distribution of at least one variable along blue path. Clearly, $y_1$ and $s_1$ are theoretically observable, and are on the confounding path. Adding either of these to our model would allow us to use the backdoor adjustment formula (explained in the paper). However, this would take us back to Option 1, where we have to model the joint distribution of either sequences of observations $y_{0:T}$ or sequences of states $s_{0:T}$, both assumed to be high-dimensional and difficult to model.

Option 3: causally correct partial models

So we finally got to the core of what is proposed in the paper: a kind of halfway-house between modeling everything and modeling too little. We are going to model enough variables to be able to evaluate causal queries, while keeping the dimensionality of the model we have to fit low. To do this, we change the data generating process slightly - by splitting the policy into two stages:

The agent first generates $z_t$ from the state $s_t$, and then uses the sampled $z_t$ value to make a decision $a_t$. One can understand $z_t$ as being a stochastic bottleneck between the agent's high-dimensional state $s_t$, and the low-dimensional action $a_t$. The assumption is that the sequence $z_{0:T}$ should be a lot easier to model than either $y_{0:T}$ or $s_{0:T}$. However, if we now build a model $p(r(y_T), z_{0:T} \vert a_{0:T})$ are now able to use this model evaluate the causal queries of interest, thanks for the backdoor adjustment formula. For how to precisely do this, please refer to the paper.

Intuitively, this approach helps by adding a low-dimensional stochastic node along the confounding path. This allows us to compensate for confounding, without having to build a full generative model of sequences of high-dimensional variables. It allows us to solve the problem we need to solve without having to solve a ridiculously difficult subproblem.

inFERENCe

The Future of Software

Machine Code

The Fate of Machine Code and Assembly

What language will we "program" in?

How Will we Trust AI Coding?

My Summary in a Figure

Deep Learning is Powerful Because It Makes Hard Things Easy - Reflections 10 Years On

🤡 's predictions on architecture and scaling

🎯 "Deep learning is powerful exactly because it makes hard things easy"

🤡 Strikes Again? Probabilistic Programming and MCMC

🎯 Generative Modeling

⚖️ PhD vs Company Residency

🍿 Section: Are We Wrong Today?

Discrete Diffusion: Continuous-Time Markov Chains

Continuous time homogeneous chains

Non-homogeneous Markov chains and point processes

Point processes

Continuous point processes

Point processes and Markov Chains

Summary

We may finally crack Maths. But should we?

What if we do crack this

Loss of meaning

Mortal Komputation: On Hinton's argument for superhuman AI.

Summary

Mortal Computation

Advantages of digital brains

Does the argument hold water?

How humans learn vs how Hinton's brains learn

The "it will never work" type arguments

Summary and Conclusion

Autoregressive Models, OOD Prompts and the Interpolation Regime

AR models > distributions over sequences

Zero probability prompts

Consequence for maximum likelihood training

Low probability prompts

Language Models and the Interpolation Regime

We May be Surprised Again: Why I take LLMs seriously.

Are we wrong about LLMs?

My own attempts at dismissing LLMs

The No Free Lunch flavoured argument

The "We're optimizing the wrong objective function" argument

Implicit Bayesian Inference in Large Language Models

Exchangeable sequences as Implicit Learning Machines

From Exchangeable sequences to Mixtures of HMMs

In context learning

Eastern European Guide to Writing Reference Letters

High-Level Goals

Basic Hygiene and Format

Contents

Relative ranking of students

Writing style

Causal inference 4: Causal Diagrams, Markov Factorization, Structural Equation Models

Markov factorization

Statistical vs causal inference

Connection to domain adaptation

Do calculus

SEMs, Markov factorization, and the reparamtrization trick

Classes of causal models

On Information Theoretic Bounds for SGD

A simple guessing game

Illustrating this in more detail

Mutual Inforrmation-based Generalization Bounds

Why we can't apply this to SGD?

Adding noise to SGD

Notes on the Origin of Implicit Regularization in SGD

Why deep learning works at all

Analysing the effect of finite stepsize

Stochastic Gradients

Connection to generalization

Summary

An information maximization view on the $\beta$-VAE objective

guest post with Dóra Jámbor

First, some notation:

Motivation and assumptions of $\beta$-VAEs

Marginal versus Conditional Independence

Explaining away

Why does this matter?

Marginally Independent Latent Factors