OpenAI o3 Benchmarks Drop: Superhuman Performance on Coding

The ARC-AGI Benchmark

The ARC (Abstraction and Reasoning Corpus) benchmark, created by AI researcher François Chollet, was designed to test general intelligence rather than narrow pattern recognition. The tasks in ARC required solving novel visual puzzles using abstract reasoning — the kind of flexible thinking that, Chollet argued, required genuine general intelligence rather than statistical pattern matching on training data. Until o3's preview, no AI system had scored above 50% on ARC-AGI, despite frontier models scoring near-perfect on many other benchmarks.

o3's Score

OpenAI previewed o3's performance on November 22, 2025, reporting 87.5% on ARC-AGI with high compute (and 75.7% with a lower compute budget). Chollet himself acknowledged that these results exceeded what he had expected any current system to achieve, suggesting o3 represented a qualitative advance in AI reasoning capabilities rather than merely more of the same pattern recognition.

Other Benchmark Results

Alongside ARC-AGI, OpenAI reported o3 scores of 96.7% on AIME 2024 (near-perfect on a competition designed to challenge the best human mathematical minds), 87.7% on GPQA Diamond, and 71.7% on SWE-bench Verified — a significant improvement over o1's 48.9% on the same benchmark. The SWE-bench score in particular had direct commercial implications: o3 could autonomously fix more than 70% of real-world software bugs in testing.

The AGI Debate

The results provoked a wave of commentary from AI researchers. Some, including Gary Marcus and Yoshua Bengio, argued the scores reflected sophisticated pattern matching rather than genuine reasoning. Others, including several OpenAI-affiliated researchers, argued the ARC-AGI performance was qualitatively different from previous benchmark results. The debate remained unresolved, but o3's capabilities were undeniable — whatever its underlying mechanism, it performed tasks previously considered beyond AI's reach.

What This Means for Indian Businesses

The o3 benchmark results are significant news for Indian AI researchers and the broader Indian tech community. The 87.5% ARC-AGI score means AI has passed a test specifically designed to require human-like general reasoning — not pattern matching on training data. For Indian AI policymakers and industry leaders, this accelerates the timeline for considering AI's impact on professional knowledge work: legal, financial, medical, and engineering analysis.