Skip to content
News

The o3 Leak That Asks an Uncomfortable Question: Is 10x the Compute Worth It?

Leaked OpenAI documents suggest o3 barely beats o1 on most reasoning tasks while costing 10x more to run. Public benchmarks arrive in March.

3 min read
The o3 Leak That Asks an Uncomfortable Question: Is 10x the Compute Worth It?

A set of internal OpenAI documents, reportedly obtained by Substack-based AI researchers and circulating as of February 23, 2026, is making the rounds in the AI community — and the numbers are not exactly the triumph OpenAI’s pricing would imply. According to the leaked materials, o3 scores only marginally higher than o1 across most standard reasoning benchmarks, while requiring roughly ten times the compute to get there. The model’s standout performance, the documents suggest, is confined almost entirely to specialized mathematics and advanced physics problems.

OpenAI declined to comment on the documents. The company has said it plans to release public benchmark results by March 2026.

What the Numbers Actually Say

The central claim from the leaked materials is that o3’s gains over o1 are real but narrow on general reasoning tasks — the kind enterprises actually care about day-to-day: legal document analysis, code review, business logic, structured data interpretation. On those workloads, the gap between the two models appears thin enough that many organizations running o1 in production would struggle to justify upgrading on performance grounds alone.

Where o3 does pull ahead is on the kind of hard-science problem sets that populate olympiad-style benchmarks — the ones AI labs love to headline because the numbers look dramatic. On competition math and graduate-level physics, o3 reportedly delivers a meaningful step up. Whether that matters to an enterprise paying premium API rates for a customer support pipeline is, charitably, a different question.

The Compute Problem Nobody Wants to Talk About

The ten-times compute cost figure is the detail that should give enterprise buyers pause. Inference cost isn’t abstract — it flows directly into API pricing, latency, and the economics of running AI at scale. If o3’s edge over o1 is real but narrow on most tasks, operators are essentially paying a significant premium for performance improvements that may never surface in their actual use case.

This is not a new pattern in AI releases. Labs have historically led with benchmark headlines that reflect best-case performance on curated test sets, while real-world deployment tells a messier story. The o3 situation, if the leaked documents are accurate, fits that pattern uncomfortably well.

How Much Should You Trust a Leaked Memo?

Leaked internal documents are tricky. They can be genuine, selectively excerpted, outdated drafts, or some combination of all three. OpenAI’s silence on the matter doesn’t confirm or deny anything — companies routinely decline to authenticate leaked materials regardless of their accuracy. The March benchmark release OpenAI has pointed to will be the real test: if the public numbers tell a different story than the leaked ones, that’s significant; if they align, it’s a different kind of significant.

What makes this particular leak credible enough to take seriously is its specificity. Vague claims about one model being “better” than another are easy to fabricate. Claims about compute ratios and performance breakdowns by task category are harder to invent convincingly, and harder still to get exactly right by accident.

What This Means for Enterprise Buyers Right Now

If you’re currently running o1 and considering an o3 upgrade, the honest answer based on what’s been reported is: wait for March. OpenAI’s forthcoming public benchmarks will either validate the leaked numbers or refute them, and either outcome tells you something useful. If o3 genuinely shines only on hard math and physics, and your workload involves neither, the case for paying the premium is weak. If the public benchmarks show broader gains the leaked docs missed, that’s a real reason to reassess.

For the broader AI market, this episode is a useful reminder that compute scaling and capability scaling are not the same thing — a lesson the industry keeps learning and keeps forgetting. Spending ten times more to run a model is only rational if the model is ten times more useful, and on most enterprise tasks, that bar remains stubbornly uncleared.

author avatar
promptyze

promptyze

ADMINISTRATOR