Skip to content
News

DeepSeek R1 Is Beating o3 on Reasoning Benchmarks — and Costs a Fraction of the Price

DeepSeek R1 matches OpenAI’s o3 on AIME and MATH benchmarks while costing roughly 90% less to run — and the weights are open-source.

3 min read
DeepSeek R1 Is Beating o3 on Reasoning Benchmarks — and Costs a Fraction of the Price

DeepSeek’s R1 model arrived in January 2025 and immediately made OpenAI’s pricing team nervous. The Chinese AI lab released full benchmark comparisons showing R1 performing at or above OpenAI’s o3 on some of the hardest mathematical reasoning tests available — while charging developers dramatically less to run it. For a field that spent 2024 convinced only the most expensive models could handle serious reasoning tasks, that’s a meaningful data point.

DeepSeek is a research lab based in Hangzhou, and it has been building a reputation for squeezing frontier-level performance out of architectures that don’t require frontier-level compute budgets. R1 is the clearest expression of that philosophy yet — a model built specifically for chain-of-thought reasoning that competes directly with o3, OpenAI’s most capable reasoning model released in December 2024.

The Benchmark Numbers

OpenAI set the baseline when o3 launched: 92.3% on AIME 2024 and 96.7% on the MATH benchmark. Those numbers were genuinely impressive and positioned o3 as the reasoning model to beat. DeepSeek R1’s published benchmarks put it in the same tier on AIME and competitive on MATH — close enough that the performance gap, if one exists, is not the story. The cost gap is.

Running inference on o3 is expensive. The model generates substantial chains of reasoning tokens before producing an answer, and OpenAI prices accordingly — o3 sits at the top of OpenAI’s API price tiers. DeepSeek R1, available via DeepSeek’s own API, costs roughly 90% less per token for comparable reasoning tasks. For developers building applications that need to run thousands of reasoning queries, that difference compounds quickly from “interesting” to “genuinely company-altering.”

Open-Source Changes the Equation

What makes R1 particularly disruptive isn’t just the API pricing — it’s that DeepSeek released the model weights openly. That means developers and researchers can download R1, run it on their own infrastructure, and pay exactly zero per-token API fees. OpenAI’s o3 is a closed model. You use it through the API, on OpenAI’s terms, at OpenAI’s prices. R1 can run on a server you control, in a country with different data regulations, without any ongoing dependency on a single vendor.

This matters beyond the obvious cost argument. Enterprise buyers who can’t send sensitive data to OpenAI’s servers have a credible alternative that benchmarks suggest performs at a comparable level. Academic researchers with limited compute budgets can run state-of-the-art reasoning experiments without burning grant money on API calls. The open-source release didn’t just undercut o3’s price — it removed the entire pricing conversation for a significant class of users.

A Fair Caveat Worth Stating

Benchmark comparisons between models from different labs deserve some scrutiny. Evaluation setups differ — the number of reasoning attempts allowed, temperature settings, prompting strategies, and hardware configurations all affect final scores. DeepSeek published its benchmark methodology alongside the results, but independent verification from third parties is still ongoing as of early March 2025. The AIME and MATH scores are close enough between R1 and o3 that small methodological differences could shift the “winner” either direction. The cost differential, however, is not close — that gap is large enough to survive any reasonable margin of error.

It’s also worth noting that o3-mini exists as OpenAI’s own lower-cost reasoning option, and it closes some of the price gap. But o3-mini trades capability for cost, whereas R1 is making a claim that you don’t have to make that trade at all.

What This Means for the Reasoning Race

The reasoning model arms race just got a new entrant that doesn’t play by the same economic rules. If DeepSeek’s benchmark claims hold up under independent evaluation — and early community testing suggests they largely do — then the assumption that cutting-edge reasoning requires cutting-edge spending is officially up for debate. OpenAI’s o3 remains a powerful model with strong ecosystem integration, but “best reasoning model” and “only viable reasoning model” are increasingly different categories. Developers who were waiting for a credible alternative to expensive closed reasoning APIs now have one that comes with weights attached.

author avatar
promptyze

promptyze

ADMINISTRATOR