OpenAI’s o3 Just Hit 99.2% on ARC-AGI — The Benchmark That Was Supposed to Be Unbeatable
OpenAI’s o3 model hit 99.2% on ARC-AGI, the abstract reasoning benchmark designed to resist AI, jumping from o1’s 85% and reshaping expectations for machine intelligence.
OpenAI dropped a technical report on March 30, 2026, showing their o3 model hit 99.2% accuracy on ARC-AGI — the Abstraction and Reasoning Corpus benchmark that was literally designed to be hard for AI. This is the test researchers built to measure actual general intelligence, not pattern matching or memorization. For context, o1 scored 85% last year. GPT-4 barely cracked 5%.
ARC-AGI isn’t your typical benchmark. It’s a set of visual logic puzzles that require understanding abstract rules and applying them to new situations — exactly the kind of reasoning humans do naturally but machines traditionally bomb. The fact that o3 is now solving 99 out of 100 puzzles correctly has researchers scrambling to figure out what changed.
What Makes ARC-AGI Different
Most AI benchmarks test knowledge retrieval or pattern recognition. ARC-AGI tests whether a model can figure out a rule from a few examples and apply it to something new. Think of it like this: you see three grids where red squares always move right, then you’re asked to predict what happens in a fourth grid. Simple for humans. Brutal for AI.
François Chollet, who created ARC-AGI back in 2019, designed it specifically to resist brute-force learning. The test has only 400 training puzzles and 400 evaluation puzzles. You can’t just throw compute at it and hope for the best. Or at least, that was the theory until now.

How o3 Did It
OpenAI’s technical report is light on architectural details, but they confirm o3 uses extended chain-of-thought reasoning with what they call “adaptive search.” Translation: the model doesn’t just spit out an answer. It explores multiple reasoning paths, backtracks when it hits dead ends, and refines its approach based on intermediate results.
This isn’t entirely new — o1 already did chain-of-thought — but o3 apparently does it at a different scale. The model can run reasoning loops for minutes instead of seconds, testing hypotheses and checking its own work. According to the report, o3 used “high compute” configuration for the 99.2% score, which likely means thousands of reasoning tokens per puzzle.
The cost? OpenAI hasn’t published pricing yet, but researchers estimate a single ARC-AGI puzzle could cost $10-50 in compute at o3’s current rates. Not exactly practical for everyday use, but that’s not the point right now.
Why This Actually Matters
ARC-AGI was supposed to be the benchmark that separated narrow AI from general intelligence. The fact that o3 effectively solved it means one of two things: either the benchmark needs an upgrade, or we just crossed a threshold in AI reasoning capabilities.
Chollet himself acknowledged the achievement on X, saying the score “significantly exceeds what we expected to see in 2026” and that his team is already working on ARC-AGI-2, a harder version. But he also pointed out that o3’s approach — burning massive compute per task — isn’t how human intelligence works. We solve these puzzles in seconds with minimal mental effort.

Still, the implications are hard to ignore. If o3 can handle abstract reasoning at this level, what happens when you point it at scientific research, legal analysis, or strategic planning? OpenAI is clearly betting that reasoning-first models are the path to AGI, and this benchmark result backs that up.
When You Can Actually Use It
Don’t rush to the API just yet. o3 is currently in internal testing and won’t hit public beta until Q2 2026 — likely May or June based on OpenAI’s usual rollout patterns. When it does launch, expect tiered access: a cheaper “low compute” version for standard tasks and the full “high compute” mode reserved for complex reasoning where accuracy matters more than speed.
The model will probably slot in above o1 in OpenAI’s lineup, positioned as the go-to for research, analysis, and any task where you need the AI to actually think through a problem rather than pattern-match its way to an answer.
What Comes Next
This result will definitely accelerate the arms race. Anthropic’s Claude Opus 4.6 currently leads on many reasoning benchmarks, but hasn’t touched ARC-AGI publicly yet. Google’s Gemini 2.5 Pro has strong reasoning chops but no published ARC-AGI scores either. Expect both teams to start talking about abstract reasoning a lot more in the next few months.
The bigger question is whether 99.2% on ARC-AGI means we’re close to AGI or just that we got really good at one specific type of test. Chollet’s working on ARC-AGI-2 suggests even he thinks the current version might be getting saturated. But for now, o3 just did something most researchers thought was years away — and that’s worth paying attention to.


