Did xAI Misrepresent the Benchmarks of Grok 3?

by admin 1 year ago

1 year ago

Discussions regarding AI benchmarks and their presentation by AI organizations are becoming increasingly visible in public discourse.

Recently, an OpenAI staff member accused Elon Musk’s AI venture, xAI, of disseminating misleading benchmark results related to their latest AI model, Grok 3. However, xAI co-founder Igor Babushkin defended the company’s integrity in this context.

The reality probably lies somewhere in the middle.

In a post on xAI’s blog, the organization released a graph indicating Grok 3’s performance on the AIME 2025, a series of challenging math questions from a recent mathematics invitational. Several experts have critiqued AIME’s effectiveness as an AI benchmark. Nevertheless, AIME 2025 and its earlier variants are frequently utilized to assess a model’s mathematical capabilities.

According to xAI’s graph, two versions of Grok 3, namely Grok 3 Reasoning Beta and Grok 3 mini Reasoning, outperformed OpenAI’s top-performing model, o3-mini-high, on AIME 2025. However, OpenAI team members on X were quick to highlight that xAI’s graph omitted the score of o3-mini-high at “cons@64.”

So, what is cons@64? It’s an abbreviation for “consensus@64,” which essentially allows a model 64 attempts to tackle each benchmark question, taking the most frequently generated answers as the final responses. This approach typically enhances the benchmark scores for models, making the exclusion of this metric from a chart misleading as it could suggest that one model performs better than another, when that may not actually be the case.

When evaluated at “@1”—the first attempted score—Grok 3 Reasoning Beta and Grok 3 mini Reasoning scored lower than the o3-mini-high. Grok 3 Reasoning Beta even slightly lags behind OpenAI’s medium-computing o1 model. Despite this, xAI continues to promote Grok 3 as the “world’s smartest AI.”

Babushkin stated on X that OpenAI has previously released similarly ambiguous benchmark charts, though those were comparisons featuring its own models. A more impartial topic observer created a different graph that presents a more “accurate” depiction of various models’ performances at cons@64:

It’s amusing that some interpret my chart as an attack on OpenAI, while others see it as a jab at Grok. In reality, it’s more akin to DeepSeek propaganda.
(I actually believe Grok fares well there, and the questionable practices behind o3-mini-*high*-pass@”””1″”” deserve more attention.) pic.twitter.com/3WH8FOUfic

— Teortaxes▶️ (DeepSeek 🐋 supporter 2023 – ∞) (@teortaxesTex) February 20, 2025

AI researcher Nathan Lambert noted in a post that perhaps the most critical metric remains unclear: the computational and financial costs each model incurred to achieve its peak score. This highlights the inadequacy of most AI benchmarks in conveying the true limitations—and strengths—of various models.

Compiled by Techarena.au.
Fanpage: TechArena.au
Watch more about AI – Artificial Intelligence

Did xAI Misrepresent the Benchmarks of Grok 3?

About Us

Top Categories

Latest Articles

Editor's Picks

Roku Introduces Standalone App for...

Rivian’s Offshoot to Develop Autonomous...

Truecaller Achieves Milestone: 500 Million...

CareCloud, the healthcare data powerhouse,...

Did xAI Misrepresent the Benchmarks of Grok 3?

The Consequences of HP’s Acquisition of Humane

Grok 3 Seems to Have Temporarily Censored Negative References to Trump and Musk

You may also like

About Us

Top Categories

Latest Articles

Editor's Picks