YouTube Shorts Challenges TikTok With Music-Making AI for Creators
WiredNOV 14 2023
An investigation has claimed that a dataset used to help train artificial intelligence (AI) models from companies such as Apple, Anthropic, and Nvidia contains subtitles from over 100K YouTube videos that were included without the consent of the content creators.
YouTube Subtitles, which is part of a large dataset known as The Pile, contains captions from over 173K videos that span 48K channels. Taking data from the platform without prior approval would violate YouTube guidelines.
The dataset was first released in 2020, with a Google spokesperson saying that the company has taken action against "abusive, unauthorized scraping." Channels included in the dataset include Harvard, MrBeast, and the BBC.