Notebookcheck Logo

Largest AI training image dataset taken offline after discovery of troubling illicit material

The LAION-5B dataset contains over 5.8 billion image-text pairs (Image Source: LAION - edited)
The LAION-5B dataset contains over 5.8 billion image-text pairs (Image Source: LAION - edited)
A Stanford study has discovered thousands of explicit images of child abuse in LAION-5B, the largest image dataset for training AI models, including Stable Diffusion. Following this revelation, LAION has temporarily taken its datasets offline to ensure they are safe before republishing.

A study published by the Stanford Internet Observatory has made a disturbing discovery – LAION-5B, the largest image dataset used for training AI image generation models, contains 3,226 images suspected to be child sexual abuse material (CSAM). LAION has since pulled its dataset from public access, until it can make sure they are free of any unsafe content.

LAION-5B, an open-source dataset consisting of over 5.8 billion pairs of online image URLs and corresponding captions, is used to train AI models, including the highly popular Stable Diffusion. It has been created by using Common Crawl to scrape the internet for a wide range of images. 

David Thiel and the team of researchers at Stanford who authored then study started by filtering the dataset using LAION’s NSFW classifiers, then relied on PhotoDNA, a tool commonly used for content moderation in this context. Since viewing CSAM is illegal, even for research purposes, the team used perceptual hashing, which creates a unique digital signature for each image and uses that signature to match it to a test image to check if it is identical or similar. Further, the team sent the ‘definite matches’ to be validated by Canadian Centre for Child Protection.

Following the publishing of the study, a spokesperson for Stable Diffusion told 404 Media that the company has numerous filters in place internally that would not only eliminate CSAM and other illegal and offensive material from the data actually used in training, but would also ensure the input prompts and images generated by the AI model are cleaned.

Under US federal law, it is illegal to possess and transmit not just CSAM, but also "undeveloped film, undeveloped videotape, and electronically stored data that can be converted into a visual image" thereof. However, since datasets like the LAION-5B contain only URLs and not the images themselves, the exact legality around them is unclear. The broader issue is further exacerbated by the fact that AI generated CSAM is hard to distinguish from actual CSAM, and is on the rise. Even though 3200 images among 5 billion may seem insignificant, the potential influence of such 'contaminated' training data on the output of generative AI models can not be ignored.

The study published by David Thiel and his team highlights one of the more disturbing consequences of AI's sudden proliferation. Finding solutions to such concerns will be a slow and difficult task over the coming years, involving in equal parts the legislature, law-enforcement, the tech industry, academicians and the general public.

static version load dynamic
Loading Comments
Comment on this article
Please share our article, every link counts!
> Expert Reviews and News on Laptops, Smartphones and Tech Innovations > News > News Archive > Newsarchive 2023 12 > Largest AI training image dataset taken offline after discovery of troubling illicit material
Vishal Bhardwaj, 2023-12-23 (Update: 2023-12-23)