Major AI Training Data Deals (2020–2025)

Tracking the emergence of data markets

Source: Open Data Labs (opendatalabs.xyz)

2024
News CorpOpenAI
$250M (5 years)
News Corp — OpenAI deal, 2024-05
RedditGoogle
$60M/year
Ongoing and historical data access with Google.
ShutterstockApple
$50M
Apple → Shutterstock | visual | image, video | lang: NA | 2024 | training
YelpPerplexity AI
$47M
Licensed review and location data to Perplexity AI and Neeva, plus other LLM companies.
ShutterstockAmazon
$25M–$50M
Amazon → Shutterstock | visual | image, video | lang: NA | 2024 | training
Wileyundisclosed
$23M
Select content used to train LLM | Wiley's former SVP for Strategy is now SVP and General Manager for AI Growth. The job description is "Lead new AI growth opportunities in content licensing, AI appli
Dotdash MeredithOpenAI
$16M
Dotdash Meredith — OpenAI deal, 2024-05
Taylor & Francis / InformaMicrosoft
$450M ARR
Informa / Taylor & Francis — Microsoft deal, 2024-05
Prisa MediaOpenAI
$1M–$5M
OpenAI → Prisa Media | news_media | text | lang: Spanish | 2024 | combined
Le MondeOpenAI
$1M–$5M
Le Monde — OpenAI deal, 2024-03
HarperCollinsMicrosoft
$5K
Uses nonfiction books for AI training purposes | The deal is for three years and authors must opt-in to be included. Terms of the agreement were one-off, not an ongoing AI licensing right. Additional
Financial TimesMicrosoft
Undisclosed
Microsoft → Financial Times | news_media | text | lang: English | 2024 | rag
HearstMicrosoft
Undisclosed
Microsoft → Hearst | news_media | text | lang: English | 2024 | rag
ReutersMicrosoft
Undisclosed
Microsoft → Reuters | news_media | text | lang: English | 2024 | rag
USA Today NetworkMicrosoft
Undisclosed
Microsoft → USA Today Network | news_media | text | lang: English | 2024 | rag
AdweekPerplexity AI
Undisclosed
Perplexity AI → Adweek | news_media | text | lang: English | 2024 | rag
BlavityPerplexity AI
Undisclosed
Perplexity AI → Blavity | news_media | text | lang: English | 2024 | rag
DPReviewPerplexity AI
Undisclosed
Perplexity AI → DPReview | news_media | text | lang: English | 2024 | rag
Der SpiegelPerplexity AI
Undisclosed
Perplexity AI → Der Spiegel | news_media | text | lang: German | 2024 | rag
FortunePerplexity AI
Undisclosed
Perplexity AI → Fortune | news_media | text | lang: English | 2024 | rag
Gear PatrolPerplexity AI
Undisclosed
Perplexity AI → Gear Patrol | news_media | text | lang: English | 2024 | rag
LA TimesPerplexity AI
Undisclosed
Perplexity AI → LA Times | news_media | text | lang: English | 2024 | rag
Lee EnterprisesPerplexity AI
Undisclosed
Perplexity AI → Lee Enterprises | news_media | text | lang: English | 2024 | rag
Mexico News DailyPerplexity AI
Undisclosed
Perplexity AI → Mexico News Daily | news_media | text | lang: Spanish | 2024 | rag
Minkabu InfonoidPerplexity AI
Undisclosed
Perplexity AI → Minkabu Infonoid | news_media | text | lang: Japanese | 2024 | rag
NewsPicksPerplexity AI
Undisclosed
Perplexity AI → NewsPicks | news_media | text | lang: Japanese | 2024 | rag
Prisa MediaPerplexity AI
Undisclosed
Perplexity AI → Prisa Media | news_media | text | lang: Spanish | 2024 | rag
RTL Germany SternPerplexity AI
Undisclosed
Perplexity AI → RTL Germany Stern | news_media | text | lang: German | 2024 | rag
RTL Germany ntvPerplexity AI
Undisclosed
Perplexity AI → RTL Germany ntv | news_media | text | lang: German | 2024 | rag
TIMEPerplexity AI
Undisclosed
Perplexity AI → TIME | news_media | text | lang: English | 2024 | rag
The IndependentPerplexity AI
Undisclosed
Perplexity AI → The Independent | news_media | text | lang: English | 2024 | rag
The Texas TribunePerplexity AI
Undisclosed
Perplexity AI → The Texas Tribune | news_media | text | lang: English | 2024 | rag
World History EncyclopediaPerplexity AI
Undisclosed
Perplexity AI → World History Encyclopedia | news_media | text | lang: English | 2024 | rag
ShutterstockReka
Undisclosed
Reka → Shutterstock | visual | image, video | lang: NA | 2024 | training
Automattic (Tumblr/WordPress)OpenAI, Midjourney
Undisclosed
Automattic (Tumblr/WordPress) — OpenAI, Midjourney deal, 2024-02
Axel SpringerMicrosoft
Undisclosed
Axel Springer — Microsoft deal, 2024-04
Financial TimesOpenAI
$450M ARR
Financial Times — OpenAI deal, 2024-04
RedditOpenAI
undisclosed + ad partnership
Reddit — OpenAI deal, 2024-05
Stack OverflowOpenAI
Undisclosed
Stack Overflow — OpenAI deal, 2024-05
The AtlanticOpenAI
Undisclosed
The Atlantic — OpenAI deal, 2024-05
VoxOpenAI
Undisclosed
Vox Media — OpenAI deal, 2024-05
TIMEOpenAI
Undisclosed
Time — OpenAI deal, 2024-06
Multiple publishers (Time, Der Spiegel, Fortune, etc.)Perplexity AI
Revenue share
Multiple publishers (Time, Der Spiegel, Fortune, etc.) — Perplexity deal, 2024-07
Condé NastOpenAI
Undisclosed
Condé Nast — OpenAI deal, 2024-08
FT, Axel Springer, The Atlantic, Fortune, UMGProRata AI
50% subscription revenue share
FT, Axel Springer, The Atlantic, Fortune, UMG — ProRata.ai deal, 2024-08
Oxford University Pressundisclosed
Undisclosed
OUP just confirmed it is working with "companies developing large language models" - no other details available
ReutersMeta
Undisclosed
Reuters — Meta deal, 2024-10
FT, Reuters, Axel Springer, Hearst, USA TodayMicrosoft
Undisclosed
FT, Reuters, Axel Springer, Hearst, USA Today — Microsoft deal, 2024-10
HearstOpenAI
Undisclosed
Hearst — OpenAI deal, 2024-10
WileyPotato
Not disclosed, but SVP and GM for AI Growth mentioned that for a revenue share agreement is a possibility they are open to for AI licensing deals
Used to help build Potato's tools. No details provided, but tools include automated paper review and a lab protocol generator. | This is the first deal announced as part of Wiley AI Partnerships, a "c
14 publishers (LA Times, The Independent, etc.)Perplexity AI
Revenue share
14 publishers (LA Times, The Independent, etc.) — Perplexity deal, 2024-12
2025
WileyAmazon
$100–$150M
AWS has built an open source toolkit for healthcare and life sciences, which "offers a catalog of starter agents and an orchestration framework for organizations to build and customize their agentic s
Johns Hopkins University Pressundisclosed
$5K/title
Content used to train LLMs | - In an email to authors, JHUP executive director stated that an AI licensing contract provides legal protection for JHUP against scraping and pirating - contract includes
AFPMistral
Undisclosed
Mistral → AFP | news_media | text | lang: French | 2025 | rag
The GuardianOpenAI
Undisclosed
OpenAI → The Guardian | news_media | text | lang: English | 2025 | rag
Associated PressGoogle
Undisclosed
Associated Press — Google deal, 2025-01
AxiosOpenAI
Undisclosed
Axios — OpenAI deal, 2025-01
American Association for the Advancement of Science (AAAS)ProRata AI
Not detailed in announcement, but in 2024 ProRata.ai had a 50/50 revenue split with content licensing partners based on usage
Used specifically for Gist.ai search engine with emphasis on bolstering transparency and reliability. They are not doing broad LLM training, but rather focusing on select, high quality content to keep
NEJM GroupOpenEvidence
Multiyear agreement
RAG model to inform OpenEvidence platform, which specializes in providing current medical research to doctors. | NEJM Group stressed the alignment of this deal with their values, as they hope thier re
New York TimesAmazon
undisclosed, multiyear
New York Times — Amazon deal, 2025-05
WileyPerplexity AI
Not disclosed, but AI licensing segment is driving a lot of their revenue growth — Not disclosed
"Students can access assigned Wiley curriculum materials through their institution's Enterprise Pro subscription, eliminating the need to switch between platforms." | Wiley is using this limited use o
Johns Hopkins University PressProRata AI
Not disclosed — Not detailed, but copyright holders will be "credited and compensated for their material on a per-use basis" for the propotion of content used to answer queries, calculated by a proprietary AI algorithm
Content is used to power ProRata's Gist.ai search engine answers | ProRata's licensing agreements focus on establishing reputable search results from their licensing agreements and author attribution,
Taylor & Francis / Informaundisclosed
Not disclosed — Undisclosed
Not disclosed | A third LLM deal was reported in the Q&A session of Informa's Half-Year Results meeting. The LLM partner was not disclosed, though announced as a "different customer" from other AI lic
WileyAnthropic
Not disclosed — Undisclosed, requires dual-subscription to Wiley and Claude to access Wiley content and Claude's MCP integration chatbot
Wiley's Model Context Protocol (MCP) integration is part of a Claude for Education pilot program. MCP is "an open standard that will enable integration between peer-reviewed content and AI platforms")
Bloomsburyundisclosed
20% royalty payment to authors.
Bloomsbury retaining rights to license to LLMs | Bloomsbury has given authors the opportunity to opt-in to potential future licensing agreements with a 20% royalty payment. There have been questions r