Largest AI training image dataset taken offline after discovery of troubling illicit material

The LAION-5B dataset contains over 5.8 billion image-text pairs (Image Source: LAION - edited)

A Stanford study has discovered thousands of explicit images of child abuse in LAION-5B, the largest image dataset for training AI models, including Stable Diffusion. Following this revelation, LAION has temporarily taken its datasets offline to ensure they are safe before republishing.

Vishal Bhardwaj, Published 12/23/2023 🇫🇷 🇪🇸 ...

AI Cyberlaw

A study published by the Stanford Internet Observatory has made a disturbing discovery – LAION-5B, the largest image dataset used for training AI image generation models, contains 3,226 images suspected to be child sexual abuse material (CSAM). LAION has since pulled its dataset from public access, until it can make sure they are free of any unsafe content.

LAION-5B, an open-source dataset consisting of over 5.8 billion pairs of online image URLs and corresponding captions, is used to train AI models, including the highly popular Stable Diffusion. It has been created by using Common Crawl to scrape the internet for a wide range of images.

David Thiel and the team of researchers at Stanford who authored then study started by filtering the dataset using LAION’s NSFW classifiers, then relied on PhotoDNA, a tool commonly used for content moderation in this context. Since viewing CSAM is illegal, even for research purposes, the team used perceptual hashing, which creates a unique digital signature for each image and uses that signature to match it to a test image to check if it is identical or similar. Further, the team sent the ‘definite matches’ to be validated by Canadian Centre for Child Protection.

Following the publishing of the study, a spokesperson for Stable Diffusion told 404 Media that the company has numerous filters in place internally that would not only eliminate CSAM and other illegal and offensive material from the data actually used in training, but would also ensure the input prompts and images generated by the AI model are cleaned.

Under US federal law, it is illegal to possess and transmit not just CSAM, but also "undeveloped film, undeveloped videotape, and electronically stored data that can be converted into a visual image" thereof. However, since datasets like the LAION-5B contain only URLs and not the images themselves, the exact legality around them is unclear. The broader issue is further exacerbated by the fact that AI generated CSAM is hard to distinguish from actual CSAM, and is on the rise. Even though 3200 images among 5 billion may seem insignificant, the potential influence of such 'contaminated' training data on the output of generative AI models can not be ignored.

The study published by David Thiel and his team highlights one of the more disturbing consequences of AI's sudden proliferation. Finding solutions to such concerns will be a slow and difficult task over the coming years, involving in equal parts the legislature, law-enforcement, the tech industry, academicians and the general public.

Source(s)

Stanford Digital Repository, 404 Media

Offline AI-driven processing coming soon (Image source: Generated using DALL·E 3)

Adobe-backed SlimLM delivers cloud-free mobile AI capabilities 11/21/2024

AI research meeting (Generated using DALL·E 3)

New training approach aims to reduce social bias in AI 06/26/2024

Operation Triangulation is Kaspersky's ongoing investigation of the iOS attack (Image Source: Bing AI)

Most sophisticated iPhone malware attack ever seen detailed in Kaspersky's 'Operation Triangulation' report 12/31/2023

Apple AI researcher, Zhe Gan, revealed Apple's Ferret Large Language Model (LLM) in October. (Source: X/Twitter)

Apple's first public LLM is called Ferret, powered by 8 Nivida A100 GPUs 12/29/2023

OPPO Find X7 series slated to launch as first-gen AndesGPT smartphones 12/28/2023

Screenshots of the Telegram group show camera footage from bedrooms for sale

Hacked security cam footage from bedrooms, dressing rooms, spas found to be on sale on Telegram group 12/21/2023

AI use has a high carbon footprint (symbolic image: Bing AI)

AI image uses as much energy as a phone charge — are ChatGPT & co. harmful to the climate? 12/19/2023

AI models' higher computational loads mean higher energy usage in data centres (Image Source: rawpixel)

AI industry set to become one of the biggest carbon emissions contributor, says IT sustainability expert 12/13/2023

Meta and other tech giants form an AI Alliance to counter ChatGPT 12/05/2023

ChatGPT, Character AI, and Google Bard dominate the global AI landscape 12/02/2023

Which comes first: misinformation or false evidence? (Image source: pixabay/Pexels)

Ingenious: ChatGPT creates false study on false claim 11/23/2023

Anonymous Sudan claimed responsibility. Unrelated to Anonymous. (Source: MS Bing/DALL-E 3)

Cyberattacks disrupt OpenAI's ChatGPT 11/17/2023

AI can do many things, including plagiarism. (pixabay / geralt)

ChatGPT texts detectable - model reliably recognizes AI plagiarism 11/07/2023

Shocking leak details the intricate mind behind OpenAI and ChatGPT 04/01/2023

Loading Comments

Comment on this article

BIOS update makes Intel Core Ultra ...

NOWATCH launches signature smartwat...

Vishal Bhardwaj - Tech Writer - 234 articles published on Notebookcheck since 2023

I've always been interested in technology, but it was my passion for music-making and photography that led me to dive deep into audio and imaging technology and understand all the tools that had fascinated me. I hold degrees in electronics engineering and business management, and have worked as a software developer as well as in marketing. Apart from being an amateur musician and photographer, I enjoy reading books, being outdoors, cooking and lately, running.

Please share our article, every link counts!