This Week in AI: Perhaps It's Time to Set Aside AI Benchmarks for the Moment

Welcome to TechCrunch’s AI newsletter! We’re taking a break for a while, but you can catch all our AI articles, including my columns, daily insights, and breaking news, at TechCrunch. If you want these updates and much more delivered to your inbox daily, sign up for our newsletters here.

This week, billionaire Elon Musk’s AI venture, xAI, unveiled its latest flagship model, Grok 3, which powers the company’s Grok chatbot applications. Capable of leveraging approximately 200,000 GPUs, the model outperforms several top competitors, including those from OpenAI, on key benchmarks in mathematics, programming, and other areas.

But what exactly do these benchmarks signify?

At TC, we often share benchmark data with a hint of skepticism, as they are one of the few (albeit limited) standardized methods the AI industry utilizes to gauge model advancements. Popular AI benchmarks typically assess obscure knowledge and produce aggregate scores that do not necessarily reflect proficiency in the tasks most relevant to users.

As Wharton professor Ethan Mollick noted in recent posts on X following the launch of Grok 3 on Monday, there’s a pressing need for improved testing methods and independent evaluators. Many AI companies tend to self-report benchmark outcomes, as Mollick highlighted, rendering these results harder to accept at face value.

“Public benchmarks are often lackluster and oversaturated, making AI evaluations akin to subjective food reviews,” Mollick stated. “If AI is essential to our work, we require more robust evaluations.”

There’s no shortage of independent evaluations and organizations aiming to introduce new benchmarks for AI, but their overall value remains a topic of debate within the industry. Some AI analysts suggest aligning benchmarks with economic outcomes to heighten their relevance, while others advocate for adoption and utility as the ultimate measures of success.

This discourse may persist indefinitely. Perhaps we should, as suggested by X user Roon, shift our focus away from newly released models and benchmarks unless major advancements in AI occur. For our collective peace of mind, this might not be a terrible strategy, even if it leads to some degree of AI-related anxiety.

As previously mentioned, This Week in AI is going on pause. Thank you for sticking with us, readers, through this exhilarating journey. Until next time!

News

**Image Credits:**Nathan Laine/Bloomberg / Getty Images

OpenAI seeks to “uncensor” ChatGPT: Max has written about OpenAI’s shift in its AI development strategy to actively promote “intellectual freedom,” regardless of how controversial or difficult the subject may be.

Mira’s new venture: Former OpenAI CTO Mira Murati has launched her new enterprise, Thinking Machines Lab, which aims to develop tools tailored to meet individual needs and goals in AI.

Introducing Grok 3: xAI, the AI startup founded by Elon Musk, has released Grok 3 and showcased new features within its Grok apps for iOS and the web.

A major Llama conference: Meta is set to host its inaugural developer conference focused on generative AI this spring. Named LlamaCon after its Llama family of generative AI models, the event will take place on April 29.

AI and Europe’s digital independence: Paul has highlighted OpenEuroLLM, a consortium of around 20 organizations collaborating to create “foundation models for transparent AI in Europe,” aimed at safeguarding the “linguistic and cultural diversity” of all European Union languages.

Research Paper of the Week

OpenAI ChatGPT website displayed on a laptop screen is seen in this illustration photo. — **Image Credits:**Jakub Porzycki/NurPhoto / Getty Images

OpenAI researchers have introduced a new AI benchmark called SWE-Lancer, designed to assess the coding capabilities of advanced AI systems. This benchmark includes over 1,400 freelance software engineering tasks that cover a variety of responsibilities ranging from minor bug fixes and feature deployments to complex “manager-level” technical proposals.

OpenAI claims that the top AI model, Anthropic’s Claude 3.5 Sonnet, achieves a score of 40.3% on the entire SWE-Lancer benchmark, indicating that AI still has a considerable way to go. It’s also worth mentioning that the researchers did not assess newer models such as OpenAI’s o3-mini or the R1 from Chinese AI firm DeepSeek.

Model of the Week

A Chinese AI firm named Stepfun has launched an “open” AI model, Step-Audio, which can comprehend and generate speech in multiple languages. Step-Audio supports Chinese, English, and Japanese, allowing users to modify the emotion and even dialect of the synthesized audio it produces, including singing.

Stepfun is among several well-capitalized Chinese AI startups that are releasing models under a flexible license. Established in 2023, Stepfun recently completed a funding round totaling several hundred million dollars, backed by various investors, including Chinese state-owned private equity firms.

Grab Bag

Nous Research DeepHermes — **Image Credits:**Nous Research

Nous Research, an AI research organization, has launched what it claims to be one of the first AI models that merges reasoning with “intuitive language model capabilities.”

The model, DeepHermes-3 Preview, can alternate between long “chains of thought” for enhanced accuracy, albeit at a greater computational cost. In “reasoning” mode, DeepHermes-3 Preview emulates other reasoning AI models by engaging in extended processing for complex problems and outlining its thought process to reach an answer.

Anthropic reportedly intends to release a structurally similar model soon, and OpenAI has indicated that such a model is also on its immediate roadmap.

Compiled by Techarena.au.
Fanpage: TechArena.au
Watch more about AI – Artificial Intelligence

This Week in AI: Perhaps It’s Time to Set Aside AI Benchmarks for the Moment

About Us

Top Categories

Latest Articles

Editor's Picks

Roku Introduces Standalone App for...

Meta Launches Initial Testing of...

Rivian’s Offshoot to Develop Autonomous...

CareCloud, the healthcare data powerhouse,...