NEWS

OpenAI's GPT-4 using YouTube transcripts to improve its language model

By VARINDIA - 2024-05-09

More than a million hours of YouTube video transcriptions have been used by OpenAI to improve GPT-4, their sophisticated language model. Even with the knowledge of possible legal repercussions, OpenAI defended its conduct by citing fair usage as a means of improving the worldview of its model. OpenAI's President, Greg Brockman, was directly involved in the selection of training videos.

OpenAI uses "numerous sources including publicly available data and partnerships for non-public data."The company is also contemplating the creation of its own synthetic data. The company had trained its models using data such as computer code from Github, chess move databases, as well as educational content from Quizlet. After other resources were depleted, it considered using transcriptions from YouTube videos, podcasts, and audiobooks.

The report also mentions that OpenAI had exhausted useful data sources by 2021. OpenAI is consistently sourcing data to improve its AI models.

Google's representative, Matt Bryant, stated that the company has "seen unconfirmed reports" about OpenAI's use of YouTube transcripts. He said that Google's guidelines prohibit unauthorized scraping or downloading of YouTube content.

YouTube CEO Neal Mohan made similar comments this week regarding OpenAI's potential use of YouTube data to train its Sora video-generating model. Bryant also highlighted that Google enforces "technical and legal measures" to prevent unauthorized usage when there's a clear legal or technical justification.