Home AI - Artificial Intelligence Court Documents Reveal Meta Employees Considered Utilizing Copyrighted Material for AI Training

Court Documents Reveal Meta Employees Considered Utilizing Copyrighted Material for AI Training

by admin

Internal discussions among Meta employees about the use of copyrighted materials sourced through questionable legal avenues for training the company’s AI models have been revealed in recently unsealed court documents.

The documents were introduced by the plaintiffs in the Kadrey v. Meta case, part of a series of AI copyright legal battles progressing through the U.S. court system. Meta, as the defendant, asserts that utilizing intellectual property-protected materials, especially books, for model training qualifies as “fair use.” However, the plaintiffs, which include notable authors like Sarah Silverman and Ta-Nehisi Coates, contest this claim.

Earlier documents in the case suggested that Mark Zuckerberg, CEO of Meta, had given his approval for the AI team to utilize copyrighted materials for training purposes, and that Meta had ceased negotiations for licensing data with book publishers. The latest filings, primarily comprising snippets of internal communications among Meta staff, provide a clearer picture of how the company may have sourced copyrighted data for its AI models, including those in the Llama series.

In one discussion, staff members, including Melanie Kambadur, a senior manager with the Llama model research team, deliberated on the implications of training their models with works they recognized might pose legal complications.

“My view aligns with the notion of ‘asking for forgiveness rather than seeking permission’: we should attempt to acquire the books and escalate it to executives for a final decision,” expressed Xavier Martinet, a Meta research engineer, in a February 2023 chat, according to the records. “This initiative is precisely why they formed this generative AI organization: to reduce our risk aversion.”

Martinet proposed purchasing ebooks at retail to create a training dataset instead of arranging licensing deals with individual publishers. When a colleague voiced concerns about potential legal consequences from using unauthorized copyrighted content, Martinet reiterated his position, suggesting that “numerous” startups were likely already resorting to pirated books for their training processes.

“In the worst case, we find out that it’s finally permissible, while an endless number of startups are just pirating countless books via Bittorrent,” he stated, according to the filings. “Just my two cents again: negotiating with publishers directly is a prolonged process.”

In the same conversation, Kambadur noted that Meta was in discussions with document hosting service Scribd “and others” for licensing. However, she cautioned that while obtaining approvals for using “publicly available data” for training models was still essential, Meta’s legal team was showing “less conservativeness” than before regarding such approvals.

“Indeed, we must secure licenses or approvals for publicly available data,” Kambadur remarked, as per the filings. “The distinction now is that we possess greater financial resources, more legal counsel, enhanced business development support, and the capability to expedite processes, alongside a legal team that is being slightly less conservative with approvals.”

Considerations Regarding Libgen

In another exchange from the documents, Kambadur contemplated utilizing Libgen, a “links aggregator” that offers access to copyrighted materials, as an alternative to licensed data sources for Meta.

Libgen has faced multiple lawsuits, been ordered to shut down, and has incurred fines amounting to tens of millions of dollars due to copyright violations. A colleague of Kambadur responded with a screenshot from a Google search result explaining that “No, Libgen is not legal.”

Some Meta executives appeared to believe that not utilizing Libgen for model training could significantly impair Meta’s competitive position in the AI arena, as indicated in the records.

In an email directed to Joelle Pineau, Vice President of Meta AI, Sony Theakanath, the director of product management at Meta, described Libgen as being “crucial to achieving SOTA numbers across all categories,” referring to excelling in the best state-of-the-art (SOTA) categories for AI models and benchmarks.

Theakanath also outlined “mitigation” strategies in the email to minimize Meta’s legal risks, such as eliminating data obtained from Libgen that was “clearly labeled as pirated/stolen” and intentionally refraining from publicly citing its use. “We would not disclose the use of Libgen datasets for training,” as Theakanath expressed.

In practical terms, these mitigations involved searching through Libgen files for terms like “stolen” or “pirated,” according to the filings.

In a separate work chat, Kambadur noted that Meta’s AI team had also adjusted their models to “evade IP risky prompts,” meaning they configured the systems to not respond to inquiries such as “reproduce the first three pages of ‘Harry Potter and the Sorcerer’s Stone’ or “list the ebooks used in your training.”

The filings reveal additional insights, suggesting that Meta may have scraped data from Reddit for some form of model training, possibly by imitating the operations of a third-party application known as Pushshift. It is worth noting that in April 2023, Reddit announced its intention to start charging AI companies for access to data intended for model training.

In a chat from March 2024, Chaya Nayak, director of product management for Meta’s generative AI division, revealed that Meta’s leadership was contemplating “overriding” previous decisions regarding training data, including a choice against using content from Quora, licensed books, and scientific articles, to ensure the availability of adequate training material for their models.

Nayak implied that Meta’s internally sourced training datasets — spanning posts on Facebook and Instagram, transcriptions from videos on Meta platforms, and specific Meta for Business messages — simply did not suffice. “We require additional data,” she wrote.

Since the filing of the case in the U.S. District Court for the Northern District of California, San Francisco Division, in 2023, the plaintiffs in Kadrey v. Meta have modified their complaint several times. The most recent amendment asserts that Meta, among other allegations, cross-referenced certain pirated books with licensed copyrighted works to evaluate the feasibility of entering license agreements with publishers.

Highlighting the significance Meta attributes to the legal implications, the company has bolstered its defense team in this case with two litigators specializing in Supreme Court cases from the law firm Paul Weiss.

Meta did not provide an immediate response when contacted for a comment.

Compiled by Techarena.au.
Fanpage: TechArena.au
Watch more about AI – Artificial Intelligence

You may also like

About Us

Get the latest tech news, reviews, and analysis on AI, crypto, security, startups, apps, fintech, gadgets, hardware, venture capital, and more.

Latest Articles