The German research group LAION, known for developing the dataset behind Stable Diffusion and various other AI models, has unveiled a refreshed dataset, asserting it has meticulously eliminated any connections to suspected child sexual abuse material (CSAM).
This updated collection, dubbed Re-LAION-5B, represents an enhanced version of the original LAION-5B dataset, refined based on guidance from entities such as the Internet Watch Foundation, Human Rights Watch, the Canadian Center for Child Protection, and the since-dissolved Stanford Internet Observatory. It is offered in two variants: Re-LAION-5B Research and Re-LAION-5B Research-Safe, the latter of which also expunges additional NSFW elements. According to LAION, both sets were sifted to exclude thousands of links associated with known and potentially suspect CSAM.
“From its inception, LAION has been dedicated to excising illegal materials from its datasets, employing effective measures right from the start,” LAION stated in a blog entry. “LAION upholds a strict policy to expunge unlawful content immediately upon discovery.”
It’s crucial to understand that the datasets from LAION do not — and have never — contained direct images. Instead, they compile indexes of image links and alternative texts, all sourced from a separate collection, the Common Crawl, which aggregates content from various websites and web pages.
The Re-LAION-5B’s launch follows a probing analysis by the Stanford Internet Observatory in December 2023, revealing that the original LAION-5B dataset, specifically a segment designated LAION-5B 400M, contained over 1,679 links to illegal imagery derived from social media and prominent adult content platforms. The study highlighted that the 400M segment also housed links to a plethora of objectionable content, ranging from explicit images to racial derogations and damaging societal stereotypes.
Although the Stanford investigation acknowledged the challenges in sanitizing the dataset and the indirect impact of CSAM on the trained models’ output, LAION opted to temporarily withdraw the LAION-5B from availability.
The Stanford review advised that any models informed by the LAION-5B dataset “be deprecated and distribution stopped wherever possible.” Notably, Startups like Runway, after partnering with Stability AI for the development of the inaugural Stable Diffusion model, have recently retracted versions of their AI models from public platforms in response, an action we’re currently seeking further details on.
The newly launched Re-LAION-5B dataset, consisting of approximately 5.5 billion text-image pairings and distributed under the Apache 2.0 license, allows external entities to refine the pre-existing LAION-5B datasets by eradicating any overlaps with the identified unlawful content, according to LAION.
LAION emphasizes the intent for its datasets to serve research objectives, not commercial exploitation. Despite this, renowned entities, including Stability AI and Google, have historically harnessed LAION datasets for augmenting their image-generation AI models.
“In total, 2,236 suspected CSAM links, which include 1,008 identified by the Stanford Internet Observatory’s December 2023 investigation, were eradicated in coordination with our partner organizations,” LAION elaborated in its announcement. “We strongly encourage all research entities and groups currently utilizing the outdated LAION-5B to transition to the Re-LAION-5B datasets promptly.”
Compiled by Techarena.au.
Fanpage: TechArena.au
Watch more about AI – Artificial Intelligence


