We May be Surprised Again: Why I take LLMs seriously.
"Deep Learning is Easy, Learn something Harder" - I proclaimed in one of my early and provocative blog posts from 2016. While some observations were fair, that post is now evidence that I clearly underestimated the impact simple techniques will have, and probably gave counterproductive advice.
I wasn't alone in my deep learning skepticism, in fact I'm far from being the most extreme deep learning skeptic. Many of us who grew up working in Bayesian ML, convex optimization, kernels and statistical learning theory confidently predicted the inevitable failure of deep learning, continued to claim deep nets did nothing more than memorize training data, ignoring all evidence to the contrary.
What was behind this? Beyond the unappealing hackiness of early DL, a key reason is that we misapplied some intuitions from key work in statistical learning theory. Well-known findings from learning theory (of the Vapnik–Chervonenkis or Rademacher flabour) gave guarantees for generalisation when the model class was sufficiently small. Many of us informally misused these results to imply that "you can't have generalisation unless your model class is simple". And deep learning is the opposite of a simple model classes. Ergo, it won't/can't work. Any evidence to the contrary was therefore dismissed as cherry picking, random seed hacking or overfitting dressed up as success.
To be clear, there has been a lot of poorly motivated or poorly reproducible research, especially in RL, and thus some skepticism was well justified. To alleviate doubts, I still think theoretically motivated skepticism is a good thing, and rigour - which much of early deep learning lacked - is important. But some of deep learning is clearly more than overfitting dressed up as success, and many of us resisted this general idea for too long.
What changed this? An important change-point in many of our attitudes was the 2016 paper Understanding deep learning requires rethinking generalization. The way I remember this paper is this: "deep nets have maximal Rademacher complexity, generalisation theory thus predicts deep learning shouldn't work, but it is clear it does, therefore our theory is insufficient." This seems almost trivial now, but back then, it represented a massive shift. It's the theory that needed fixing, not deep learning. It opened up a massive opportunity for people to come up with new theory, develop new intuitions. We're nowhere near a modern theory of deep learning, but we know a lot more about the components at play.
It may have been alchemy, but some actual gold was produced.
Are we wrong about LLMs?
Today, I see a similar pattern of resistance to taking LLM results seriously. Many of my colleagues' views on LLMs have not changed at all over the last couple of years. Some describe them as being good only at "regurgitating training data". I also see similar patterns of misusing and overgeneralising theoretical arguments.
For example, the field of causal inference established the impossibility of inferring causal structure from i.i.d. observations. I wrote about this in my post on causal inference. Many people then overgeneralise this important but narrow finding to mean "an ML model can absolutely never learn causal reasoning unless you add some extra causal model". But what exactly does this result have to do with whether LLMs can (appear to) correctly reason about causal structure in the world when they complete prompts? LLMs, indeed, are statistical models fit to i.i.d. data. But the non-identifiability result is only relevant if we apply it to learning causal relationships between consecutive tokens in text. But those are not the causal relationships we need. We want the LLM to "understand" or appear to understand that kicking a ball results in the ball moving. That causal relationship is encoded in language, it doesn't have to be inferred from i.i.d. observational data.
My own attempts at dismissing LLMs
My instinct has been to reject the idea of pre-trained LLMs as the general-purpose inference machines they are branded as. I visited OpenAI in 2018, where I was shown a preview of what turned out to be key GPT-2 and GPT-3 results: the ability for a pre-trained LM to solve problems it wasn't explicitly trained, via zero-shot prompt engineering. My immediate reaction was that this can't possibly work. Sure, you'll be able to do it better than chance, but this approach will never be even nearly competitive with specialised solutions. It's interesting to try and formalize why I thought this. I had two kinds of thoughts about this:
The No Free Lunch flavoured argument
Even though I personally never thought very much of no free lunch theorems in ML, in retrospect my reason for dismissing the 'one model can do it all' approach of GPT was essentially a no free lunch argument in disguise.
I tried to formalize my argument along these lines:
- Consider a distribution of tasks we want to solve. For simplicity, let's assume each task is a supervised prediction problem, where we have a joint distribution over some input $x$ and corresponding desired output $y$. Each task $\mathcal{T}$ would then be a joint distribution over $(x, y)$ pairs, and perhaps an associated loss function.
- I considered the language model as being nothing more than a stochastic process over character sequences. I didn't really care it was trained on natural language. I just thought of it as distribution of completions given prompts. I was happy to assume that the LLM could be the best such distribution there is.
- We use the LLM to solve a task $\mathcal{T}$ in an encoder-LLM-decoder sandwich. The encoder would be a mapping between some input $x$ to a character sequence or prompt (a.k.a. the prompt engineering part). The decoder would then take the LLM's completion and return a label or prediction $\hat{y}$ of some sort.
In this setting, it feels intuitively true that the more distinct tasks $\mathcal{T}_i$ we want to be able to solve satisfactorily, the more likely it will be that conflicts emerge between the different tasks. Mathematically, this could be resolved by increasing the average length of prompts needed to solve all tasks. For example, if you only want your model to be able to solve a single task, like English-Spanish translation, your prompt only has to contain the source sentence. If you now also want to do English-French tanslation, or sentiment prediction, your prompts have to be longer to indicate to the model what task it should solve and how it should interpret the prompt. My hypothesis was, informally, that if we want a single encoder-LLM-decoder model to be able solve all tasks we care about, the prompt lengths in that model would have to be exponentially long at the very least. I think it likely that one could prove a formal result along these lines for a sufficiently rich of target tasks. This kind of thinking is very similar to Turing machine/universal grammar thinking, and there are a lot of parallels to Chomsky's objections to LLM.
However, I no longer subscribe to this no free lunch argument, for the same reason I actually never really felt my work was limited by other no free lunch theorems.
It's very difficult to describe all tasks we ever want an agent to solve before we label it 'intelligent' or 'capable', or 'a good Bing'. Let's call these tasks useful tasks. I now suspect that useful tasks are a tiny sliver of "all possible tasks" one would probably encounter in a formally stated no free lunch theorem. We don't need AI to solve any adversarially chosen task, we'll be happy if it can solve the typical task we expect it to solve. This is similar to saying we don't actually care about approximating any measurable function over images, we care about building tools that can do certain things in images, like recognise pedestrians.
The "We're optimizing the wrong objective function" argument
I have also been a long-time proponent of paying attention to the objective function, and less attention to things like architecture. I said if your objective function you optimise doesn't reflect the task you're using your model for, no amount of engineering or hacks will help you bridge that gap. I made this argument repeatedly in the context of representation learning, generative modelling, video compression, etc.
I argued that maximum likelihood is not a good objective function for representation learning, creating the memorable poop-diagram, which I continue to use in my lectures today. As LLMs are being trained via likelihood, it's only natural that my first instinct was that maximum likelihood can't be a good objective function for generally intelligent behaviour™ either. Why would getting better at next-token-prediction lead to near-optimal behaviour in a range of tasks that are, at best, underrepresented in the training data?
I have now abandoned this argument as well. Why? Because these arguments of mine do not consider inductive biases of the training process. I have now realised that $\operatorname{argmin}\mathcal{L}$ is a very poor description of what actually happens in deep learning. It's pointless to hope that any minimum of a loss function will have a desired property, it's sufficient in practice if the loss function has some good minima with the desired attributes and that SGD has a tendency to find those over the bad minima.
True, we have barely a clue on what inductive biases SGD on a model like GPT-3 has - even less if we consider components like RLHF or CoT prompting. But the fact that we that can't describe it doesn't mean that unreasonably helpful inductive biases can't be there. And evidence is mounting that they are there.
As intellectually unsatisfying as this conclusion is, the LLM approach works, but most likely not for any of the reasons we know. We may be surprised again.