Apple, NVIDIA, and Anthropic allegedly used transcripts from over 173,000 YouTube videos to train their AI models without any permission.
Proof News' recent investigation found that the companies have obtained the dataset from a nonprofit company called Eleuther AI.
YouTube Video Transcripts Allegedly Help Train AI Models
Proof News shared that around 173,536 subtitles from YouTube videos were gathered from more than 48,000 channels to complete the dataset. Silicon Valley giants like Apple, NVIDIA, Salesforce, and Anthropic have allegedly used the data.
The dataset featured transcripts across various content on YouTube like Khan Academy, MIT, NPR, BBC, "The Late Show With Stephen Colbert," "Jimmy Kimmel Live," and more. Materials from popular creators like MrBeast, Marques Brownlee, Pewdiepie, and Jacksepticeye were also spotted.
Proof News also clarified that the dataset is focused on supplying plain text subtitles and did not include any video imagery. The majority of the subtitles gathered were in languages like Japanese, German, and Arabic.
AI Companies Faces Scrutiny From Content Creators
AI companies have been receiving complaints and copyright infringement lawsuits due to the issue of data scraping. According to YouTube, using its data to train AI models is a violation of the platform's terms and services.
The creator of MKBHD, Brownlee, expressed that data scraping is "going to be an evolving problem for a long time." He also noted how Apple seemingly avoids being at fault due to indirect scraping.
Similarly, Anthrophic spokesperson Jennifer Martinez referred the potential violations to the authors of the dataset, also known as "The Pile." The AI company emphasized that YouTube's term only covers the direct use of the platform and that using The Pile is a different case.
Apple, NVIDIA, and other companies have not responded to the issue.
Related Article : Bumble Allows Users to Report AI-Generated Profile Photos