Every generative AI technology, from Google’s Gemini, Anthropic’s Claude, to OpenAI’s latest GPT-4o edition, engages in a form of creative interpretation, often leading to outputs ranging from entertainingly inaccurate to problematic fabrications.
However, these models don’t all fabricate information at the same rate, and the nature of their inaccuracies varies based on their trained datasets.
A study by scholars from Cornell, University of Washington, University of Waterloo, and the AI2 institute compared the frequency of inaccuracies in models like GPT-4o against trusted sources across various fields like law, health, and history. The study demonstrated that no model was supremely accurate across all areas, with the more accurate ones typically sidestepping questions that could lead to incorrect answers.
“The core insight from our research is that we’re still a long way from being able to completely trust these AI-generated responses,” said Wenting Zhao of Cornell, one of the study’s authors. “Currently, the top-performing models only manage to produce error-free responses about 35% of the time.”
While other academic endeavors have explored model accuracy, Zhao highlights that previous assessments tended to focus on easily verifiable questions, a far cry from the complex inquiries models face in real-world scenarios.
In their more rigorous benchmark, researchers included questions that lack Wikipedia references, challenging the models with real-world queries that span various topics, from popular culture to scientific disciplines. They evaluated over a dozen models, including both newly released and established ones, to gauge their performance.
Findings showed a persistent issue with accuracy among models, contradicting advancements claimed by leading AI developers like OpenAI and Anthropic.
In benchmark testing, GPT-4o showed only slightly better accuracy than its predecessor, GPT-3.5, with OpenAI’s iterations leading in minimizing inaccuracies, followed closely by models like Mixtral 8x22B and Command R.
Certain topics, particularly those in finance and celebrity news, proved challenging for the models, while topics well-documented in training data, like geography and computer science, saw higher accuracy rates. The disparity in performance suggests a universal reliance on Wikipedia by the AI models.
Despite the expectations for improvement, Zhao warns that information distortion is likely to remain a significant challenge.
“Our findings show that while there are methods to lessen the occurrence of such errors, the overall capacity for improvement is modest,” Zhao remarked. “Furthermore, our study reveals the inherent conflict in online information, exacerbated by the discrepancies found in human-authored training datasets.”
A potential temporary measure could involve programming the AI to refrain from responding when uncertain, akin to teaching a presumptuous individual to hold back unnecessary comments.
During their tests, for example, Claude 3 Haiku chose not to answer nearly 28% of the questions, thereby emerging as the most accurate in terms of presenting factual information.
Yet, the utility of a model that frequently opts not to respond remains in question. Zhao believes developers should prioritize developing methods to curb misinformation, though completely avoiding it may be unrealistic. She advocates for engaging human expertise in validating AI-generated information, highlighting the need for policies and tools to ensure the accuracy and trustworthiness of AI communications.
Compiled by Techarena.au.
Fanpage: TechArena.au
Watch more about AI – Artificial Intelligence


