April 11, 2026

Can AI do Math? Part One

Impressive, but bounded

Precision technical illustration of a vast circular chamber viewed from above in isometric perspective. The chamber is organized in concentric rings, each containing open compartments that hold geometric constructions — triangles, polyhedra, graphs, and curves rendered in fine slate-blue linework on an off-white background. Small human figures work at desks and consult references in the outer rings. At the center, a mechanical apparatus assembles a large crystalline polyhedron highlighted in amber-gold, with one final triangular piece being guided into place to complete the structure.

I’ve found the current discourse about AI and math deeply confusing: for those of us who aren’t mathematicians, it’s hard to figure out what’s hype and what’s substantive. Does solving an Erdős problem represent a meaningful breakthrough, or does it just mean the AI tracked down a previously-published answer to a problem nobody ever cared enough about to investigate?

The answer turns out to be complicated but interesting: frontier models are impressively good at math—and getting better fast—but they’re a long way from putting mathematicians out of work. In many ways, math is like coding: AI is getting quite good at doing many of the mundane things that mathematicians spend their time doing, but it lacks the taste and high-level understanding required to do genuinely novel work.

In Part One, I’ll review what AI has already conquered:

Traditional evaluations like MATH 1-5 and GSM8K are essentially saturated: to a first approximation, any problem you’d find on a grade school or high school math exam is now easy for AI.
Competitive math is also largely solved as of 2025, with AI beating all but the best humans at both the International Math Olympiad (IMO) and the Putnam.

In Part Two, we’ll look at what remains, and what AI means for professional mathematicians:

AI really has solved several Erdős problems. Some of the solutions are modestly novel, but none are truly notable.
There’s a new generation of math evaluations based on interesting unsolved research problems. AI has made modest progress on those, but hasn’t yet solved any truly significant problems.
What does all of that mean for professional mathematicians? AI is useful, but it’s far from being able to do groundbreaking work on its own.

1: Traditional evaluations are largely saturated

Just a few years ago, LLMs struggled to do basic math that any high schooler should be able to handle. Those days are gone: traditional evaluations are mostly saturated, and those that remain typically require proving college-level theorems. At this point, AI can outperform almost anyone without a STEM degree.

GSM8K

The GSM8K evaluation, introduced in 2021, tested basic grade school math:

Every day, Wendi feeds each of her chickens three cups of mixed chicken feed, containing seeds, mealworms and vegetables to help keep them healthy. She gives the chickens their feed in three separate meals. In the morning, she gives her flock of chickens 15 cups of feed. In the afternoon, she gives her chickens another 25 cups of feed. How many cups of feed does she need to give her chickens in the final meal of the day if the size of Wendi's flock is 20 chickens?

GSM8K was fully saturated by 2024. AI is pretty much done with those types of word problems, and I doubt it misses them any more than I do.

MATH

The MATH benchmark (2021) tested competition-level high school math and saturated by the end of 2024. This was perhaps the last benchmark that most people with a non-STEM degree would have a chance at solving:

The equation x^2 + 2x = i has two complex solutions. Determine the product of their real parts.

miniF2F

The few unsaturated benchmarks mostly require proving formal results rather than merely solving problems. Here’s an example from miniF2F, a benchmark using Olympiad-level problems:

Prove that if |x-2| = p, where x < 2, then x - p = 2 - 2p

I think it’s fair to say that most college-educated adults wouldn’t get much traction on mini2F. The original mini2F was saturated by specialized theorem provers by the end of 2025, but has not yet been saturated by general-purpose LLMs.

ProofNet

ProofNet, like miniF2F, is a proof-based evaluation that tests undergraduate-level analysis, linear algebra, algebra, and topology. It’s still an active evaluation: the highest score to date is 41.4%.

Prove that a group of order 312 has a normal Sylow p-subgroup for some prime p dividing its order.

Competition math

The best models can now match elite humans in competitive math at both the high school and college level. That’s super impressive, but competitive math is a niche endeavor that is quite different from both applied and research mathematics. Their performance here says a lot about their overall capabilities, but not much about their ability to do useful math.

International Math Olympia (IMO)

The IMO is the most prestigious high school math competition in the world. In 2025, models from Google DeepMind and OpenAI both scored 35/42 at the IMO, a score which would earn a human contestant a gold medal.

Putnam Exam

The Putnam Exam is the most prestigious college-level math competition. Several models achieved excellent scores in 2025, led by DeepSeek-v3.2-Speciale, which scored 103 / 120. That score would have placed a human in the top 3 contestants (out of 4,335) and earned them a Putnam Fellowship.

What next?

The results we’ve seen so far are impressive, but don’t say anything about research-level math. We’ll tackle that in Part Two, starting with the (in)famous Erdős problems.