Home AI - Artificial Intelligence Did xAI Misrepresent the Benchmarks of Grok 3?

Did xAI Misrepresent the Benchmarks of Grok 3?

by admin

Discussions regarding AI benchmarks and their presentation by AI organizations are becoming increasingly visible in public discourse.

Recently, an OpenAI staff member accused Elon Musk’s AI venture, xAI, of disseminating misleading benchmark results related to their latest AI model, Grok 3. However, xAI co-founder Igor Babushkin defended the company’s integrity in this context.

The reality probably lies somewhere in the middle.

In a post on xAI’s blog, the organization released a graph indicating Grok 3’s performance on the AIME 2025, a series of challenging math questions from a recent mathematics invitational. Several experts have critiqued AIME’s effectiveness as an AI benchmark. Nevertheless, AIME 2025 and its earlier variants are frequently utilized to assess a model’s mathematical capabilities.

According to xAI’s graph, two versions of Grok 3, namely Grok 3 Reasoning Beta and Grok 3 mini Reasoning, outperformed OpenAI’s top-performing model, o3-mini-high, on AIME 2025. However, OpenAI team members on X were quick to highlight that xAI’s graph omitted the score of o3-mini-high at “cons@64.”

So, what is cons@64? It’s an abbreviation for “consensus@64,” which essentially allows a model 64 attempts to tackle each benchmark question, taking the most frequently generated answers as the final responses. This approach typically enhances the benchmark scores for models, making the exclusion of this metric from a chart misleading as it could suggest that one model performs better than another, when that may not actually be the case.

When evaluated at “@1”—the first attempted score—Grok 3 Reasoning Beta and Grok 3 mini Reasoning scored lower than the o3-mini-high. Grok 3 Reasoning Beta even slightly lags behind OpenAI’s medium-computing o1 model. Despite this, xAI continues to promote Grok 3 as the “world’s smartest AI.”

Babushkin stated on X that OpenAI has previously released similarly ambiguous benchmark charts, though those were comparisons featuring its own models. A more impartial topic observer created a different graph that presents a more “accurate” depiction of various models’ performances at cons@64:

AI researcher Nathan Lambert noted in a post that perhaps the most critical metric remains unclear: the computational and financial costs each model incurred to achieve its peak score. This highlights the inadequacy of most AI benchmarks in conveying the true limitations—and strengths—of various models.

Compiled by Techarena.au.
Fanpage: TechArena.au
Watch more about AI – Artificial Intelligence

You may also like

About Us

Get the latest tech news, reviews, and analysis on AI, crypto, security, startups, apps, fintech, gadgets, hardware, venture capital, and more.

Latest Articles