June 8, 2023

We may finally crack Maths. But should we?

Automating mathematical theorem proving has been a long standing goal of artificial intelligence and indeed computer science. It's one of the areas I became very interested in recently. This is because I feel we may have the ingredients needed to make very, very significant progress:

1. a structured search space with clear-cut success criterion that can be algorithmically generated: the language of formal mathematics
2. a path to obtaining very good heuristics to guide search in the space - LLMs trained on a mixture of code, formal and informal mathematics.
3. learning algorithms that can exploit the above, like AlphaZero and MuZero, with demonstrated ability of tackling some tricky search problems (in Go, and now with AlphaDev).

Wicked and Tame problems

Some problems humans have to solve are just fundamentally harder than others. To reason about this, Rittel and Webber (1973) defined the concept of wicked and tame problems in “Dilemmas in a General Theory of Planning”. Wicked problems have the following characteristics:

• they elude definitive formulation
• there is no clear stopping criterion, i.e. it’s impossible to tell if a solution has been reached
• possible solution candidates cannot be enumerated or exhaustively described
• there is no opportunity to learn from trial-and-error either because the problem deals with one-off situations or because any solution attempt may incur significant costs

Most policy problems - where to put a highway, should our country go to war, how should we regulate AI, should Britain leave the European Union - are wicked problems. So are most business, artistic or lifestyle dilemmas we’re faced with. One can't simply use BOTECs, decision theory and optimisation to capture all aspects of these problems.

By contrast, tame problems are the exact opposite: well defined problems, with clear stopping criteria, where the solution space is exhaustively describable, and trial-and-error learning is possible. Many problems in mathematics, puzzle-solving, engineering and the natural sciences, playing chess and some other sports fall into this category, according to Ritter and Webber.

To me, this classification almost reads like:

• tame = solvable by AI
• wicked = requires human judgement, accountability, responsibility

Mathematical theorem proving is a clear example of a tame problem. Notwithstanding some intricacies, theorem proving is a pretty well defined problem, and especially in formal mathematics, in languages like LEAN, Isabel and others, the possible attempts at proving a statement are exhaustively describable. For the most part, it can be determined if a solution has been reached, and solutions are either true or false. Finally, getting a proof wrong has no cost or negative consequences attached to (other than perhaps energy or time cost), so the problem is amenable to trial-and-error learning.

Of course, there is more to mathematics than the generation of formally verified proofs of statements in an arguably ugly and hard-to-read language based on dependent type theory. There is conjecture generation: how do we discover statements that are likely provable. Some proofs have aesthetic value, or can be described as beautiful. Some theorems or definitions are more useful than others, as they may have implications or might make the discovery of other theorems easier. All of these aspects notwithstanding, I argue the core problem of formal theorem proving passes Ritter and Webber’s definition of a tame problem, and thus machine intelligence may be able to deal with the problem without any human judgment in the loop.

Tame Doesn’t Mean Easy

It’s important thought to note that tame doesn’t mean easy, nor does it mean it will be solved. Indeed, a convincing solution to general automated theorem proving eluded the field for decades. I believe the correct conclusion from this analysis is that it is ‘within reach’ for machine intelligence, and new evidence on the unreasonable efficiency of code generating LLMs points at new avenues that may result in significant breakthrough. But many, great difficulties remain:

• the search space is vast: The search space of all synthactically correct proofs is obviously enormous, up there with Go, and probably even larger/slightly less well structured.
• the reward signal is sparse: In two-player games like Go, one of the players always wins, and thus gets positive reinforcement after an episode of self-play. Mathematical theorem proving, and code generation in general, is a single-player game. Most proof attempts fail. Positive reinforcement only comes if the proof is verifiably correct, and that happens very rarely.
• data is - currently - limited: While there are plenty of data out there on informal mathematics (mathematical statements and proofs written in human language), as well as code in python or JavaScript, there is relatively little data on the internet containing formally verified software or formal mathematical proofs. LEAN's matlib is probably the largest repository of formal mathematical proofs, anecdotally it covers all of the Imperial College undergraduate curriculum, but it's still but a tiny sliver of mathematical knowledge humans developed. There are attempts to leverage LLMs to help automatically formalize informal mathematics thus creating more data, but it's still an open question how much of the data bottleneck can be eliminated this way.

What if we do crack this

So, let’s assume for a minute that we can crack the mathematical theorem proving puzzle, and in particular, we will build programs capable of generating correct proofs of non-trivial mathematical statements, and perhaps even generating new mathematical statements together with correct proofs. What are the implications of this?

Dual use

AI systems that do useful things can often be repurposed accomplish very bad things. For example, AI methods designed to help with safe drug design can also be used to design deathly toxic compounds simply by switching a sign in an objective function, as shown in Dual-use of artificial-intelligence-powered drug discovery.

So what would the paper ”Dual-use of artificial-intelligence-powered mathematical theorem proving” be about? I’m not sure it’s a question many folk working on the topic ask themselves. My approach was that theorems don’t hurt anyone (I know several young people who would passionately disagree with this sentiment). But it's one question I started thinking about.

Mathematical theorem proving, especially the one based on formal mathematics, is closely related to software verification, and formally verifiable software engineering. It turns out, the main source of funding for research in the US is the Department of Defence, and indeed, this technology has military applications. Talia Ringer discusses a couple concrete examples in her lecture on the ethics of verification.

Besides military applications, it is also conceivable that formal verification could play a role in cybercrime or the creation of malicious software of different types. Verification is used in the design of AntiVirus software - at least at a very handwavy level, it is conceivable that it can be used to make “provably bad” computer viruses, too.

As you can tell I'm very hand-wavy and high-level in my understanding of this, but it's an important topic to consider. If anyone has concrete pointers on the dual use of theorem proving, please send them my way.

Contributions to General Language-based AI

Another possible concern is that a breakthrough in mathematical theorem proving may further accelerate the development and deployment of general-purpose AI tools.  And that can be a good thing or a bad thing, depending on your perspective.

Why would this happen? There is anecdotal evidence - or urban legend - out there that including formal languages (computer code) in the training data improves the quality of language models, especially having a positive an effect on their common-sense reasoning ability. Of course, the idea that learning to code improves your ability to reason is sure to put a smug smile on any computer scientist’s face, so I’d be skeptical with how strongly supported this claim is.

But, assuming that the above hypothesis holds - that there is a positive transfer effect from training on formal languages to common sense reasoning ability in LLMs - we can reasonably expect that a bunch of new training data generated in the form of formally verified proofs could turn out to be very valuable, sweet training data for LLMs. In theory, automatic theorem proving or formal verification may provide virtually unlimited high-quality training data for LLMs, further enhancing their apparent reasoning abilities. Depending on what you think about the pace of development and deployment of general-purpose language AIs, this may be an outcome you applaud, or fear.

Loss of meaning

Of course, another topic that comes up often is just how depressing and sad forfeiting mathematics to machine intelligence is. For many, mathematics is not only a job, but a pursuit they derive meaning from. If AI is ever becomes better at Maths than people, as Fields-medalist Timothy Gowers discusses on Twitter, what happens to human mathematics, what will people derive meaning from.

First of all, one can enjoy chess even if not playing at grandmaster level, and even if a Chess program easily beats me. Similarly, I personally enjoy understanding mathematics topics that have long been solved, without being the one actually discovering or proving it. So the 'pleasure of proving something' may not be completely lost. On the other hand 'being the first to prove' may become difficult, but it is already a rare source of pleasure.

Secondly, I think breakthroughs in automated theorem proving will probably pose new kind of questions for folk with mathematical intuition. Perhaps the work is going to be less about trying to prove or disprove conjectures, but a bit more about categorising, making human sense of mathematical knowledge, or guiding an increasingly automated process of mathematical discovery.