A recent study from research teams at Cohere, Stanford, MIT, and Ai2 has raised serious allegations against LM Arena, the organisation responsible for the popular AI benchmarking platform, Chatbot Arena. The authors claim that LM Arena has provided a select group of leading AI companies—namely Meta, OpenAI, Google, and Amazon—with privileged access to private testing, which has significantly skewed leaderboard scores in their favour.
The researchers assert that these industry giants were allowed to test numerous AI model variants without disclosing the lower scores, thus inflating their rankings on the leaderboard. Sara Hooker, VP of AI Research at Cohere, described the situation as “gamification,” where only a few firms were privy to these testing opportunities, undermining true competition.
Chatbot Arena, launched in 2023 as a project by UC Berkeley, pits different AI models against each other, allowing users to vote for the best response. While LM Arena has positioned itself as an impartial benchmarking authority, the findings of this paper contradict that claim. The investigation revealed that Meta, for instance, tested 27 model variations prior to the release of Llama 4, releasing only the top-performing model to the public.
In response to these allegations, Ion Stoica, co-founder of LM Arena and a professor at UC Berkeley, labelled the study’s conclusions as based on “inaccuracies” and questionable analysis. He emphasised the organisation’s commitment to fair evaluations and invited all model developers to participate in testing to enhance their models’ performance.
The investigation started in late 2024, founded on suspicions regarding potential bias towards specific AI companies. Analysis of over 2.8 million AI model “battles” indicated that LM Arena allowed certain companies to garner more data from the platform, thereby giving them an unjust advantage over others. The researchers highlighted that these practices could significantly boost performance metrics, although LM Arena contested this, maintaining that improvements on one benchmark do not directly influence another.
Furthermore, Hooker acknowledged that it was unclear how specific companies gained priority access but stressed that LM Arena should increase transparency moving forward. Continuous correspondence between the researchers and LM Arena led the latter to respond that many claims were exaggerated or misrepresented.
The paper’s authors propose several recommendations for improving fairness at LM Arena. They suggest implementing limits on private testing and maintaining transparency about test scores. While LM Arena rejected these proposals, asserting long-standing disclosure policies, it expressed openness to adjusting its sampling rates to ensure equal exposure for all models in future battles.
This scrutiny follows prior incidents where companies, including Meta, were accused of manipulating benchmarks. The arrival of this study raises critical questions about the integrity of private benchmarking entities and whether they can reliably evaluate AI models without influence from corporate interests. As LM Arena gears up to become a commercial entity, these concerns are likely to grow in significance within the AI community.
Fanpage: TechArena.au
Watch more about AI – Artificial Intelligence


