A content creator on YouTube is initiating a class action legal claim against OpenAI. The allegation is that OpenAI used countless YouTube video transcripts to train its AI models without the consent or compensation of the video creators.
In documents submitted to the U.S. District Court for the Northern District of California last Friday, legal representatives for Massachusetts resident David Millette claim OpenAI covertly used transcriptions of Millette’s and others’ YouTube content to enhance the capabilities of ChatGPT, its AI chatbot platform, among other AI-driven tools. This action, the lawsuit argues, not only enriched OpenAI on the backs of creators but also contravened copyright laws and YouTube’s usage policies that preclude such data utilization for external applications.
The lawsuit elaborates, “The advancements in [OpenAI’s] AI offerings, attributable to the harvested training datasets, make these products increasingly appealing to both existing and potential consumers, who sign up for subscriptions to leverage [OpenAI’s] AI solutions.” It further accuses OpenAI of enriching itself through unauthorized use of content, without giving due credit or financial recompense.
Represented by the Bursor and Fisher law firm, Millette is pressing for a trial by jury, along with damages exceeding $5 million on behalf of all YouTube creators affected by OpenAI’s data collection practices.
Unlike humans, generative AI tools such as those by OpenAI do not possess innate intelligence. Instead, they are “trained” through exposure to vast amounts of data (like films, recordings, text, etc.), learning to predict data patterns, including their contexts.
Typically, these AI models are fed data compiled from public internet sources and datasets. While companies argue this falls under fair use, enabling indiscriminate data scraping for commercial AI training, numerous copyright owners stand in opposition, taking legal steps to stop the practice.
As AI models face diminishing sources of new data, transcriptions of videos have emerged as a crucial resource for training.
According to Originality.AI, over 35% of the globe’s top 1,000 websites now actively block OpenAI’s web crawlers. A research by MIT’s Data Provenance Initiative discovered that about 25% of data from high-quality sources is now off-limits for major AI model training sets, leading to predictions from Epoch AI that the industry might hit a data shortage for training AI models between 2026 and 2032.
OpenAI developed its first speech recognition model, Whisper, specifically to transcribe video audio to gather more data for training, as reported by The New York Times in April. This effort included translating over a million hours of YouTube content using Whisper for training OpenAI’s model GPT-4, sparking internal discussions regarding the adherence to YouTube’s terms.
Proof News revealed in July that entities such as Anthropic, Apple, Salesforce, and Nvidia utilized a dataset named The Pile, comprising subtitles from hundreds of thousands of YouTube videos, for AI model training without content creators’ consent. Following the backlash, Apple clarified it had no plans to use these models for its product enhancements.
Google, the parent entity of YouTube, has similarly shown interest in employing such transcripts for training its algorithms.
Last year, Google updated its terms of service (ToS), enabling broader use of user data for generative AI development. This adjustment significantly clarifies Google’s stance compared to the original terms, which were vague about utilizing YouTube data for non-video platform-related products.
We are awaiting comments from OpenAI and Google regarding the class action lawsuit and will provide updates accordingly.
Meanwhile, OpenAI faces scrutiny early this month, as highlighted by a lawsuit from Tesla and X CEO Elon Musk, filed on Monday. In this lawsuit, Musk criticizes OpenAI for deviating from its non-profit roots to prioritizing commercial customers, alleging that this constitutes racketeering activity. This follows similar claims Musk made in a previous lawsuit filed in February against OpenAI.
Compiled by Techarena.au.
Fanpage: TechArena.au
Watch more about AI – Artificial Intelligence


