OpenAI has reached a deal with Reddit to use the social news site’s data for training AI models.
In a blog post on OpenAI’s press relations site, the company said that the Reddit partnership will provide it access to “real-time, structured and unique content” — e.g. posts and replies — from Reddit, allowing its tools and models to “better understand and showcase” that content. Reddit content will be incorporated into ChatGPT, OpenAI’s popular conversational AI, and the companies will work together to bring unspecified new “AI-powered features” to both Reddit users and moderators.
OpenAI will also become a Reddit advertising partner.
“Reddit will be building on OpenAI’s platform of AI models to bring its powerful vision to life,” OpenAI wrote in the post. “Using LLMs, ML, and AI allow Reddit to improve the user experience for everyone.”
OpenAI has several similar licensing deals with content providers ranging from stock media libraries to news publishers. But the unusual angle to this one is that Sam Altman, OpenAI’s CEO, has an 8.7% stake in Reddit, making him the third-largest shareholder, and was once a member of the company’s board of directors.
In an attempt to discourage scrutiny, OpenAI says in its press release that, while Altman remains a Reddit shareholder, the partnership “was led by OpenAI’s COO [Brad Lightcap]” and “approved by [OpenAI’s] independent board of directors.” (I’ll note here that Altman himself is a member of OpenAI’s board.)
Reddit has made data licensing agreements an increasingly central part of its growth strategy as it navigates the market as a public company.
In its IPO prospectus, Reddit revealed that it has contractual agreements to license its data to customers including Google worth a combined over $200 million. And, in its first earnings report as a public company, Reddit reported a 450% year-over-year increase in non-ad revenue, attributable mainly to those agreements.
Reddit stock was up 11% in extended trading following the announcement of the OpenAI deal.
“The paradox I see is that, as more content on the internet is written by machines, there’s an increasing premium on content that comes from real people,” Reddit CEO Steve Huffman said during the company’s earnings call in March. “And we have nearly two decades of authentic conversation.”
Reddit’s platform — which has over 1 billion posts and more than 16 billion comments, figures that grow every day thanks to its hundreds of millions of active users — is a goldmine for generative AI companies, whose models learn from examples of content, like text and images, to generate new, similar content.
But the company could face pushback from users concerned about how it’s monetizing their data.
It’s instructive to look at Stack Overflow, the Q&A forum for software developers, which recently inked an agreement with OpenAI to supply data for the latter’s model training. In protest, some users deleted their top-rated answers to questions on the community. But Stack Overflow restored the deleted posts and banned those users, claiming that they weren’t in compliance with its terms of service.
Reddit has already voiced its displeasure with one attempt to afford Reddit users greater control over their own data.
Vana, a startup built on the blockchain, is attempting to launch a data “DAO” (Digital Autonomous Organization) to let Reddit users pool their data and let them decide together how that combined data’s used (or sold). Reddit banned Vana’s subreddit dedicated to discussion about the DAO, in a statement to TechCrunch, and accused the company of “exploiting” its data export controls.