Home AI - Artificial Intelligence MLCommons and Hugging Face Collaborate to Launch Extensive Speech Dataset for AI Research

MLCommons and Hugging Face Collaborate to Launch Extensive Speech Dataset for AI Research

by admin

MLCommons, a nonprofit group focused on AI safety, has collaborated with the AI development platform Hugging Face to unveil one of the largest public domain voice recording collections available for AI research.

Named Unsupervised People’s Speech, this extensive dataset includes over one million hours of audio across at least 89 languages. MLCommons created this resource to foster research and development in “various areas of speech technology.”

“Promoting broader research in natural language processing for languages beyond English plays a crucial role in enhancing communication technologies for a larger global audience,” the organization noted in a blog post published on Thursday. “We foresee numerous paths for the research community to explore further, particularly in enhancing low-resource language speech models, improving speech recognition across various accents and dialects, and creating innovative applications in speech synthesis.”

This objective is certainly commendable. However, utilizing AI datasets such as Unsupervised People’s Speech comes with certain risks for researchers.

Bias in data is a significant concern. The audio recordings in Unsupervised People’s Speech were sourced from Archive.org, a nonprofit entity renowned for its Wayback Machine web archiving service. Due to the predominance of English-speaking — particularly American contributors — nearly all recordings in Unsupervised People’s Speech feature American-accented English, as mentioned in the readme on the official project page.

This raises potential issues; without diligent filtering, AI systems such as speech recognition and voice synthesis models trained on Unsupervised People’s Speech may inherit similar biases. For instance, they might struggle to accurately transcribe English spoken by non-native speakers or encounter challenges in producing synthetic voices in languages other than English.

Furthermore, Unsupervised People’s Speech might include recordings from individuals unaware of their voices’ utilization in AI research — even in commercial contexts. Although MLCommons affirms that all recordings in the dataset are public domain or backed by Creative Commons licenses, there’s always a chance that errors were made.

An analysis by MIT indicates that many publicly accessible AI training datasets lack appropriate licensing information and contain inaccuracies. Advocates for creators, such as Ed Newton-Rex, the CEO of the AI ethics-focused nonprofit Fairly Trained, argue that creators should not be compelled to “opt out” of AI datasets, as this places an undue burden on them.

“A significant number of creators (for example, those using Squarespace) have no practical means to opt-out,” Newton-Rex stated in a June post on X. “For those who can opt out, the multitude of overlapping opt-out procedures is (1) highly confusing and (2) severely lacking in comprehensive coverage. Even if an ideal universal opt-out existed, it would be extremely unjust to place the burden on creators, especially since generative AI often uses their work to compete with them — many might simply not recognize that they had the option to opt out.”

MLCommons is dedicated to updating, maintaining, and enhancing the quality of Unsupervised People’s Speech. However, due to the potential shortcomings, developers should proceed with considerable caution.

Compiled by Techarena.au.
Fanpage: TechArena.au
Watch more about AI – Artificial Intelligence

You may also like

About Us

Get the latest tech news, reviews, and analysis on AI, crypto, security, startups, apps, fintech, gadgets, hardware, venture capital, and more.

Latest Articles