An investigation has claimed that a dataset used to help train artificial intelligence (AI) models from companies such as Apple, Anthropic, and Nvidia contains subtitles from over 100K YouTube videos that were included without the consent of the content creators.
YouTube Subtitles, which is part of a large dataset known as The Pile, contains captions from over 173K videos that span 48K channels. Taking data from the platform without prior approval would violate YouTube guidelines.