Skip to content
News

OpenAI’s o3 Cracked a $100K Math Problem Nobody Could Solve

OpenAI’s o3 solved a $100,000 open math problem using extended thinking — and the benchmark numbers behind it are just as hard to ignore.

3 min read
OpenAI's o3 Cracked a $100K Math Problem Nobody Could Solve

In December 2024, OpenAI’s o3 model did something that stopped the math world mid-sentence: it solved a previously unsolved mathematical problem that had a $100,000 prize on the table. Not a benchmark. Not a curated test set. An actual open problem that human mathematicians hadn’t cracked.

The result landed during OpenAI’s “12 Days of OpenAI” announcement run, where o3 was revealed alongside benchmark numbers that made the AI research community do a collective double-take. The model’s performance on ARC-AGI — a test specifically designed to resist pattern-matching — jumped to 87.5%, compared to GPT-4o’s 5%. That’s not an incremental improvement. That’s a different category of system.

What o3 Actually Does Differently

O3 isn’t just a bigger language model. It uses a technique called “extended thinking” — essentially, the model is given time and compute to reason through problems in chains of thought before producing an answer. It doesn’t fire back an immediate response; it works through the problem. Think less autocomplete, more deliberate problem-solving process.

The $100,000 math challenge it solved was part of the FrontierMath benchmark, developed by Epoch AI in collaboration with mathematicians. These aren’t textbook problems — they’re original, research-level questions that leading mathematicians estimated would take hours to days for a human expert to work through. Previous state-of-the-art models scored under 2% on FrontierMath. O3 scored around 25%. That’s still not perfect, but the jump is the story.

Extended thinking: compute-heavy, deliberate reasoning.
Extended thinking: compute-heavy, deliberate reasoning.

The specific $100,000 problem has been tied to a prize challenge in competitive mathematics — the kind where a correct, verifiable answer unlocks a cash reward. OpenAI confirmed o3 produced a correct solution, though independent mathematical verification remains the standard the community is still working through.

The Skeptics Aren’t Wrong to Ask Questions

Here’s the legitimate concern: how do you know a model solved a problem rather than memorized a solution somewhere in its training data? With an open prize problem, there’s at least a paper trail — if the solution existed publicly, researchers would likely recognize it. But the question of genuine mathematical reasoning versus sophisticated retrieval is one that the field hasn’t fully resolved.

Independent mathematicians reviewing o3’s outputs have noted that its approaches don’t always mirror how a human mathematician would structure a proof — but that’s not disqualifying. A valid proof is a valid proof. The more pressing question is whether o3 can generalize: solving one hard problem impressively is notable; doing it reliably across novel domains would be something else entirely.

Three labs, one finish line.
Three labs, one finish line.

The Reasoning Arms Race Is Very Real

OpenAI didn’t get to enjoy this moment alone for long. Anthropic’s Claude models have been pushing extended thinking features aggressively, and Google’s Gemini 2.5 Pro has its own chain-of-thought reasoning pipeline that’s been earning strong marks on graduate-level science benchmarks. Every major lab is now treating reasoning capability as the central competition axis — not just raw knowledge, not just speed, but the ability to work through hard problems step by step.

What o3’s math result did was give OpenAI a very concrete, very publicizable proof point. A $100,000 prize is a better headline than “scored 3.2 points higher on MMLU.” It’s the kind of result that makes grant committees and research directors pay attention in a way that benchmark tables don’t.

Why This Actually Matters

The practical implication isn’t that AI is about to replace mathematicians. It’s that AI is becoming a credible research accelerator in domains that were previously off-limits — not because the models lacked facts, but because they lacked the ability to reason through genuinely novel problems. O3 suggests that gap is closing faster than most people expected.

If models can reliably tackle open research problems — even a fraction of them — the economics of scientific research shift. A lab with o3-level reasoning assistance can explore more directions faster, flag dead ends earlier, and spend human expert time on the problems that actually need human intuition. That’s not a distant future scenario. It’s a workflow that research teams are already starting to build around right now.

The $100,000 was a signal flare. The real competition is for what comes next.

author avatar
promptyze

promptyze

ADMINISTRATOR