Deep Learning is Powerful Because It Makes Hard Things Easy - Reflections 10 Years On
Ten years ago this week, I wrote a provocative and bold post that blew up, made it to top spot on HackerNews. I had just joined Magic Pony, a nascent startup, and I remember the founders Rob and Zehan scolding me for offending the very community we were part of, and - of course - deep learning developers we wanted to recruit.
The post aged in some very, hmm, interesting ways. So I thought it would be good to reflect on what I wrote, things I got very wrong, and how things turned out.
- π€‘ Hilariously poor predictions on low hanging fruit and the impact of architecture tweaks
- π― some insightful thoughts on how simplicity = power
- π€‘ predictions on development of Bayesian deep learning, MCMC
- π― some good advice nudging people to generative models
- βοΈ PhD vs company residency: What I think now?
- πΏ Who is Wrong Today? Am I wrong? Are We All Wrong?
Let's start with the most obvious blind spot in hindsight:
π€‘ 's predictions on architecture and scaling
There is also a feeling in the field that low-hanging fruit for deep learning is disappearing. [...] Insight into how to make these methods actually work are unlikely to come in the form of improvements to neural network architectures alone.
Ouch. Now this one has aged like my great-uncle-in-law's wine (He didn't have barrels so he cleaned up an old wheelie bin to serve as fermentation vat). Of course today, 40% of people credit the transformer architecture for everything that's going on, 60% credit scaling laws which are essentially existence proofs of stupendously expensive low hanging fruit.
But there's more I didn't see back then: I - and others - wrote a lot about why GANs don't work, and how to understand them better, and how to fix them using maths. Eventually, what made them work well in practice was ideas like BigGAN, which mostly used architectural tweaks rather than foundational mathematical changes. On the other hand, what made SRGAN work was thinking deeply about the loss function and making a fundamental change - a change that has been universally adopted in almost all follow-on work.
In general, lots of beautiful ideas were steamrolled by the unexplainably an unreasonably good inductive biases of the simplest methods. I - and many others - wrote about modeling invariances and sure, geometric deep learning is a thing in the community, but evidence is mounting that deliberate, theoretically inspired model designs play a limited role. Even something as successful as the convolution, once thought of as indispensable for image processing, is at risk of going the way of the dodo - at least at the largest scales.
In hindsight: There is a lot of stuff in deep learning that we don't understand nearly enough. Yet they work. Some simple things have surprisingly huge impact, and mathematical rigour doesn't always help. The bitter lesson is bitter for a reason (maybe it was the wheelie bin). Sometimes things work for reasons completely unrelated to why we thought they would work. Sometimes people are right for the wrong reason. I was certainly wrong, and for the wrong reason, multiple times. Have we run out of low hanging fruit now? Are we entering "the era of research with big compute" as Ilya said? Is Yan LeCun right to call LLMs a dead end today? (Pop some πΏ in the microwave and read till the end for more)
π― "Deep learning is powerful exactly because it makes hard things easy"
Okay, this was a great insight. And good insight is often perfectly obvious in hindsight. The incredible power of deep learning, defined as the holy trinity of automated differentiation, stochastic gradient descent and GPU libraries is that it took something PhD students did, and turned it into something 16-year-olds can play with. They don't need to know what a gradient is, not really, much less implement them. You don't need to open The Matrix Cookbook a hundred times a day to remember which way the transpose is supposed to be.
At the start of my career, in 2007, I attended a Machine Learning Summer School. It was meant for PhD students and postdocs. I was among the youngest participants, only a Master's student. Today, we run AI retreats for 16-18 year olds who work on projects like RL-based solutions to the no-three-in-lines program, or testing OOD behaviour of diffusion language models. Three projects are not far from publishable work, one student is first author on a NeurIPS paper, though I had nothing to do with that.
In hindsight: the impact of making hard things easy should not be underestimated. This is where the biggest impact opportunities are. LLMs, too, are powerful because they make hard things a lot easier. This is also our core thesis at Reasonable: LLMs will make extremely difficult types of programming - ones which kind of needed a specialised PhD to really understand - "easy". Or at least accessible to mortal human software engineers.
π€‘ Strikes Again? Probabilistic Programming and MCMC
OK so one of the big predictions I made is that
probabilistic programming could do for Bayesian ML what Theano has done for neural networks
To say the least, that did not happen (if you're wondering, Theano was an early deep learning framework, precursor to pytorch and jax today). But it was an appealing idea. If the main thing with deep learning is that it democratized "PhD level" machine learning by hiding complexity under lego-like simplicity, wouldn't it be great to do just that with the even more PhD level topic of Bayesian/probabilistic inference? Gradient descent and high dimensional vectors are hard enough to explain to a teenager but good luck explaining KL Divergences and Hamiltonian Monte Carlo. If we could abstract these things out the same way, and unlock their power, it could be great. Well, we couldn't abstract things to the same degree.
In hindsight: Commenters called it self-serving of me to predict that areas in which I had expertise in will happen to be the most important topics to work the future. And they were right! My background in information theory and probabilities did turn out to be pretty useful, but it took me some time to let go of my Bayesian upbringing. I have reflected on this in my post on secular Bayesianism in 2019.
π― Generative Modeling
In the post I suggested people learn "something harder" instead of - or in addition to - deep learning. One of those areas I encouraged people to look at was generative modelling. I gave GANs and Variational Autoencoders as examples. Of course, neither of these play role in LLMs, arguably the crown jewels of deep learning. Furthermore, generative modelling in autoregressive models is actually super simple, can be explained without any probabilistic language as simply "predicting the next token".
In hindsight: Generative modelling continues to be influential, and so this wasn't at least super bad advice to tell people to focus on it in 2016. Diffusion models, early versions of which were emerging by 2015, power most image and video generative models today, and diffusion language models may one day be influential, too. Here, at least it is true that deeper knowledge of topics like score matching, variational methods came in handy.
βοΈ PhD vs Company Residency
On this interesting topic, I wrote
A couple companies now offer residency programmes, extended internships, which supposedly allow you to kickstart a successful career in machine learning without a PhD. What your best option is depends largely on your circumstances, but also on what you want to achieve.
I wrote this in 2015. If you went and did a PhD in Europe (lasting 3-4 years) starting then, assuming you're great, you would have done well. You graduated just in time to see LLMs unfold - didn't miss too much. Plus, you would likely have done one interesting internship every single summer of your degree. But things have changed. Frontier research is no longer published. Internships at frontier labs are hard to get unless you're in your final year and the companies can see a clear path of hiring you full time. Gone are the days of publishing papers as an intern.
In the frontier LLM space, the field is so fast moving that it's truly difficult to pick a research question there that won't look obsolete by the time you write your thesis. If you pick something fundamental and ambitious enough - say adding an interesting form of memory to LLMs - your lab will likely lack the resources to demonstrate it at scale, and even if your idea is a good one, by the time you're done, the problem will be considered "essentially solved" and people start copying whatever algorithm DeepSeek or Google happened to talk about first. Of course, you can choose to not engage with the frontier questions and do something
Times have changed. Depending on what your goals, interests are and what you're good at, I'm not so sure a PhD is the best choice. And what's more! I claim that
most undergraduate computer science programs, even some elite ones, fail to match the learning velocity of the best students.
I'm not saying you should skip a rigorous degree program. My observation is that top talent can and do successfully engage with what was considered graduate-level content in their teenage years. While back then I was deeply skeptical on 'college dropouts' and the Thiel fellowship, my views have shifted significantly after spending time with brilliant young students.
πΏ Section: Are We Wrong Today?
The great thing about science is that scientists are allowed to be wrong. Progress happens when people take different perspectives, provided we admit we were wrong and update on evidence. So here you have it, obviously
I was wrong on a great many things.
But this raises questions: Where do I stand today? Am I wrong today? Who else is wrong today? Which position is going to look like my 2016 blog post in retrospect?
In 2016 I warned against herd mentality of "lego-block" deep learning. In 2026, I am marching with the herd. The herd, according to Yann LeCun, is sprinting towards a dead end, mistaking the fluency of language models with a true foundation for intelligence.
Is Yann LeCun right to call LLMs a dead end? I recall that Yann's technical criticism of LLMs started with a fairly mathematics-based theoretical argument about how errors accumulate, and autoregressive LLMs are exponentially diverging diffusers. Such an argument was especially interesting to see from Yann, who likes to remind us that naysayers doubted neural networks and have put forward arguments like "they have too many parameters, they will overfit" or "non-convex optimization gets stuck in local optima". Arguments that he blamed for standing in the way of progress. Like others, I don't now buy these arguments.
What is the herd not seeing? According to Yann, true intelligence requires an understanding of the physical world. That in order to achieve human level intelligence, we first have to have cat or dog level intelligence. Fair enough. There are different aspects of intelligence and LLMs only capture some aspects. But this is not reason enough to call them a dead end unless the goal is to create something indistinguishable from a human. A non-embodied, language-based intelligence has an infinitely deep rabbit-hole of knowledge and intelligence to conquer: an inability to catch a mouse or climb a tree won't prevent language-based intelligence to have profound impact.
On other things the herd is not seeing, Yann argues true intelligence needs "real" memory, reasoning and planning. I don't think anyone disagrees. But why could these not be built on or plugged into the language model substrate? It's no longer true that LLMs are statistical pattern matching devices that learn to mimic what's on the internet. Increasingly, LLMs learn from exploration, reason and plan pretty robustly. Rule learning, continuous learning and memory are on top of the research agenda of every single LLM company. These are going to get done.
I celebrate Yann going out there to make and prove his points, and wish him luck. I respect him and his career tremendously, even as I often find myself taking a perspective that just happens to be in anti-phase to his - as avid readers of this blog no doubt know.
But for now, I'm proudly marching with the herd.