Leaked internal comms reveal Nvidia scraping lifetime worth of YouTube videos daily to train video AI model, Jensen happy with the progress

Nvidia engineers are scraping videos from YouTube and other sources to train the company's Cosmos video foundation model. (Image Source: Nvidia)

Internal Nvidia Slack communications obtained by 404 Media revealed that the company's staff working on the Cosmos video foundation model has so far compiled 38.5 million hours of video from various sources, primarily YouTube. Employees highlighted possible copyright concerns, but the execs higher up apparently gave an "umbrella approval" to go ahead and scrape content.

Vaidyanathan Subramaniam, Published 08/06/2024 🇫🇷 🇪🇸 ...

AI Cyberlaw Nvidia

Nvidia is training its Omniverse, self-driving cars, and "digital human" cars based on data scraped from "80 years-worth of videos per day" from YouTube and other sources, an investigation by 404 Media revealed.

Leaked internal communications obtained by 404 Media indicate Nvidia is using this data to train its AI video world model dubbed Cosmos (not to be confused with the company's existing Cosmos Deep Learning service). Cosmos is internally slated to be a model that would power other Nvidia lines including GeForce, GPU architecture, DGX, Deep Learning frameworks, Omniverse, Avatar, Project GR00T, and autonomous vehicles.

Nvidia execs dubbed Cosmos as a state-of-the-art foundation model "that encapsulates simulation of light transport, physics, and intelligence in one place to unlock various downstream applications critical to Nvidia.”

404 Media accessed internal employee Slack messages that revealed how staff used the command-line yt-dlp program to download YouTube videos using 20 to 30 AWS virtual machines that refresh IP addresses to avoid getting blocked by YouTube. The video sharing site was the main source for scraping videos, with employees also mulling over other sources like Netflix and Discovery Channel.

Slack communications show employees discussing the legal ramifications of scraping copyrighted content to train AI only to be dismissed by project managers as an executive decision, and that is something they needn't worry about.

Popular YouTube channels that Nvidia employees have shortlisted include MKBHD, PickUpLimes, Architectural Digest, Expedia, Mediastorm6801, 8kEarth, and The CriticalDrinker among others.

When contacted by 404 Media, both YouTube and Netflix said that scraping content on their platforms to train AI models are a clear violation of their terms of service.

The use of copyrighted data to train AI models is still a legal grey area. Public datasets such as InternVid-10M, HD-VG-130M, and others based on millions on YouTube videos exist, but they are only meant for academic research and not for commercial purposes. Although Nvidia has academic researchers, the output will eventually make its way to a commercial product.

There have been few legislations to this effect that mandate transparency standards and requirement of companies working on foundational AI models to work with the FTC and the Copyright Office. But companies do not necessarily disclose their source datasets, which makes auditing a lot more difficult.

As major AI companies continue to lay their hands on all available public data to train more effective models, legislative changes are a dire need of the hour to ensure consumer safety and protect creator IP.

Last year, The New York Times sued OpenAI and Microsoft for unauthorized use of the publication's copyrighted articles to train AI models. In May, visual artists filed a lawsuit against Stability AI, Midjourney, DeviantArt, and Runway AI for using copies of their work to train AI models without permission.

YouTube is turning out to be a data goldmine for AI companies. Recently, Wired reported that heavyweights including Apple, Nvidia, Anthropic, and Salesforce scraped subtitles from 173,536 YouTube videos from more than 48,000 channels to train their AI.

Up till late May, Nvidia staff announced internally they had compiled 38.5 million video URLs, with the majority of them being cinematic content. The engineers also added datasets such as Ego-Exo4D, Ego4D, HOI4D, and game data from GeForce Now.

While Ego-Exo4D and Ego4D can be licensed for both academic and commercial use, HOI4D is distributed under a CC BY-NC license that specifically prohibits commercial use.

The team is currently training a 1B model each with 16 nodes, with plans to scale it up 10B.

Nvidia told 404 Media via email, "our models and our research efforts are in full compliance with the letter and the spirit of copyright law.”

Meanwhile, Nvidia CEO Jensen Huang seems to be happy with the progress his staff is making.

He reportedly exclaimed, “Great update. Many companies have to build video FM [foundational models]. We can offer a fully accelerated pipeline.”

Datasets referred for Cosmos training by Nvidia Principal Scientist Francesco Ferroni. (Source: 404 Media)

Popular YouTube channels recommended by Nvidia staff for training Cosmos. (Source: 404 Media)

Graph depicting video distribution compiled from 38.5 million URLs. (Source: 404 Media)

SCOOP from @samleecole: Leaked Slacks and documents show the incredible scale of NVidia's AI scraping: 80 years — "a human lifetime" of videos every day. Had approval from highest levels of company despite staff legal/ethical concerns:https://t.co/DydXOyffUQ
— Jason Koebler (@jason_koebler) August 5, 2024

Source(s)

404 Media (requires signup)

@jason_koebler on X

OpenAI Sora is currently one of the most popular AI video generator (Image source: OpenAI/Notebookcheck)

Google to tackle low-quality AI-generated content on YouTube 07/10/2025

ISRO develops AI model for aircraft tracking and surveillance (Image Source: Generated with DALL-E 3)

ISRO develops AI model for aircraft tracking and surveillance 11/18/2024

YouTube homepage (Image source: Generated using DALL·E 3)

YouTube allegedly testing a homepage without view count and upload date information 10/30/2024

Antonio De Rosa has made some wonderful concept designs, including this one of the iPhone Air (Image source: ADR Studio)

Apple concept artist’s designs may look “too real” for comfort, says letter from Apple’s lawyers 08/17/2024

Image source: Thibault Penin on Unsplash - edited

Disney Plus TOS bar unrelated lawsuit against Disney 08/15/2024

TikTok parent brings video-creating AI to App Store, Play Store 08/13/2024

Free ChatGPT users can now use DALL-E 3 for image creation (Image source: OpenAI [AI-generated])

ChatGPT now allows free users to generate images by using DALL-E 3 with a daily limit 08/09/2024

Nvidia is banned from selling H100 datacenter GPU to Chinese customers. (Image source: Nvidia, ridvan-selli on Pixabay, edited)

Smugglers reportedly supply China with banned Nvidia AI chips worth millions of dollars as US looks to tighten trade embargo 08/06/2024

Nvidia teams up with Falcon Northwest to give away RTX 4080-powered gaming PC 08/05/2024

"The More You Buy, The More You Save." (Image Source: CNA)

Nvidia wins: Team Green becomes world's most valuable publicly traded company 06/19/2024

Nameless AI chip concept render (Source: DALL·E 3-generated image)

Elon Musk unveils plan to acquire $9 billion worth of AI chips from Nvidia by next summer 06/04/2024

It is not yet clear when Project G-Assist will be released. (Source: Nvidia)

Nvidia's Project G-Assist gives gaming tips in real time 06/03/2024

Nvidia adds new vision, speech, and language capabilities to ChatRTX. (Source: Nvidia)

Nvidia adds new vision, speech, and language capabilities to ChatRTX - a free, local chatbot for PCs with Nvidia RTX graphics cards 05/06/2024

LATTE3D can interpret highly specific text prompts to generate a 3D model (Image Source: NVIDIA)

NVIDIA unveils LATTE3D text-to-3D generative AI model dubbed “virtual 3D printer” 03/24/2024

Nvidia CEO Jensen Huang unveils Blackwell GPU 18x+ faster than Hopper at GTC 2024. (Source: Nvidia on YouTube)

Nvidia CEO Huang unveils latest AI products for the enterprise along with advances in autonomous vehicle and humanoid robot capabilities during GTC 2024 keynote 03/19/2024

Nvidia's GPUs will primarily be used in light workstations (Image Source: Nvidia)

Nvidia unveils RTX 500 and 1000 laptop GPUs for AI-enhanced workflow 02/27/2024

Human-like robots seem to be the next big thing in high-tech. (Image source: DallE 3)

Bezos, Nvidia, OpenAI, Microsoft, Intel, Samsung, invest millions in human-like robot startup 02/24/2024

Nvidia CEO Jensen Huang (Image source: Nvidia Corp.)

Nvidia surfs the AI wave to $2 trillion market value 02/22/2024

Read all 1 comments / answer

Loading Comments

Comment on this article

Apple AirPods Max now available on ...

Samsung T5 Evo 4 TB portable SSD dr...

Vaidyanathan Subramaniam - Managing Editor - 2024 articles published on Notebookcheck since 2012

Though a cell and molecular biologist by training, I have been drawn towards computers from a very young age ever since I got my first PC in 1998. My passion for technology grew quite exponentially with the times, and it has been an incredible experience from being a much solicited source for tech advice and troubleshooting among family and friends to joining Notebookcheck in 2017 as a professional tech journalist. Now, I am a Lead Editor at Notebookcheck covering news and reviews encompassing a wide gamut of the technology landscape for Indian and global audiences. When I am not hunting for the next big story or taking complex measurements for reviews, you can find me unwinding to a nice read, listening to some soulful music, or trying out a new game.

contact me via: @Geeky_Vaidy

Please share our article, every link counts!

> Expert Reviews and News on Laptops, Smartphones and Tech Innovations > News > News Archive > Newsarchive 2024 08 > Leaked internal comms reveal Nvidia scraping lifetime worth of YouTube videos daily to train video AI model, Jensen happy with the progress

Vaidyanathan Subramaniam, 2024-08- 6 (Update: 2024-08- 6)

Source(s)

Related Articles