Major AI Training Data Deals (2020–2025)
Tracking the emergence of data markets
Source: Open Data Labs (opendatalabs.xyz)
2020–2023
Shutterstock → Meta
$25M–$50M
Meta → Shutterstock | visual | image, video | lang: NA | 2023 | training
Axel Springer → OpenAI
$1–$5M
Multi-year contracts giving OpenAI historical and ongoing access to news content.
Shutterstock → LG
Undisclosed
LG → Shutterstock | visual | image, video | lang: NA | 2023 | training
Shutterstock → NVIDIA
Undisclosed
NVIDIA → Shutterstock | visual | image, video | lang: NA | 2023 | training
Shutterstock → OpenAI
Undisclosed
OpenAI → Shutterstock | visual | image, video | lang: NA | 2023 | training
Associated Press → OpenAI
Undisclosed
Associated Press — OpenAI deal, 2023-07
2024
News Corp → OpenAI
$250M (5 years)
News Corp — OpenAI deal, 2024-05
Reddit → Google
$60M/year
Ongoing and historical data access with Google.
Shutterstock → Apple
$50M
Apple → Shutterstock | visual | image, video | lang: NA | 2024 | training
Yelp → Perplexity AI
$47M
Licensed review and location data to Perplexity AI and Neeva, plus other LLM companies.
Shutterstock → Amazon
$25M–$50M
Amazon → Shutterstock | visual | image, video | lang: NA | 2024 | training
Wiley → undisclosed
$23M
Select content used to train LLM | Wiley's former SVP for Strategy is now SVP and General Manager for AI Growth. The job description is "Lead new AI growth opportunities in content licensing, AI appli
Dotdash Meredith → OpenAI
$16M
Dotdash Meredith — OpenAI deal, 2024-05
Taylor & Francis / Informa → Microsoft
$450M ARR
Informa / Taylor & Francis — Microsoft deal, 2024-05
Prisa Media → OpenAI
$1M–$5M
OpenAI → Prisa Media | news_media | text | lang: Spanish | 2024 | combined
Le Monde → OpenAI
$1M–$5M
Le Monde — OpenAI deal, 2024-03
HarperCollins → Microsoft
$5K
Uses nonfiction books for AI training purposes | The deal is for three years and authors must opt-in to be included. Terms of the agreement were one-off, not an ongoing AI licensing right. Additional
Financial Times → Microsoft
Undisclosed
Microsoft → Financial Times | news_media | text | lang: English | 2024 | rag
Hearst → Microsoft
Undisclosed
Microsoft → Hearst | news_media | text | lang: English | 2024 | rag
Reuters → Microsoft
Undisclosed
Microsoft → Reuters | news_media | text | lang: English | 2024 | rag
USA Today Network → Microsoft
Undisclosed
Microsoft → USA Today Network | news_media | text | lang: English | 2024 | rag
Adweek → Perplexity AI
Undisclosed
Perplexity AI → Adweek | news_media | text | lang: English | 2024 | rag
Blavity → Perplexity AI
Undisclosed
Perplexity AI → Blavity | news_media | text | lang: English | 2024 | rag
DPReview → Perplexity AI
Undisclosed
Perplexity AI → DPReview | news_media | text | lang: English | 2024 | rag
Der Spiegel → Perplexity AI
Undisclosed
Perplexity AI → Der Spiegel | news_media | text | lang: German | 2024 | rag
Fortune → Perplexity AI
Undisclosed
Perplexity AI → Fortune | news_media | text | lang: English | 2024 | rag
Gear Patrol → Perplexity AI
Undisclosed
Perplexity AI → Gear Patrol | news_media | text | lang: English | 2024 | rag
LA Times → Perplexity AI
Undisclosed
Perplexity AI → LA Times | news_media | text | lang: English | 2024 | rag
Lee Enterprises → Perplexity AI
Undisclosed
Perplexity AI → Lee Enterprises | news_media | text | lang: English | 2024 | rag
Mexico News Daily → Perplexity AI
Undisclosed
Perplexity AI → Mexico News Daily | news_media | text | lang: Spanish | 2024 | rag
Minkabu Infonoid → Perplexity AI
Undisclosed
Perplexity AI → Minkabu Infonoid | news_media | text | lang: Japanese | 2024 | rag
NewsPicks → Perplexity AI
Undisclosed
Perplexity AI → NewsPicks | news_media | text | lang: Japanese | 2024 | rag
Prisa Media → Perplexity AI
Undisclosed
Perplexity AI → Prisa Media | news_media | text | lang: Spanish | 2024 | rag
RTL Germany Stern → Perplexity AI
Undisclosed
Perplexity AI → RTL Germany Stern | news_media | text | lang: German | 2024 | rag
RTL Germany ntv → Perplexity AI
Undisclosed
Perplexity AI → RTL Germany ntv | news_media | text | lang: German | 2024 | rag
TIME → Perplexity AI
Undisclosed
Perplexity AI → TIME | news_media | text | lang: English | 2024 | rag
The Independent → Perplexity AI
Undisclosed
Perplexity AI → The Independent | news_media | text | lang: English | 2024 | rag
The Texas Tribune → Perplexity AI
Undisclosed
Perplexity AI → The Texas Tribune | news_media | text | lang: English | 2024 | rag
World History Encyclopedia → Perplexity AI
Undisclosed
Perplexity AI → World History Encyclopedia | news_media | text | lang: English | 2024 | rag
Shutterstock → Reka
Undisclosed
Reka → Shutterstock | visual | image, video | lang: NA | 2024 | training
Automattic (Tumblr/WordPress) → OpenAI, Midjourney
Undisclosed
Automattic (Tumblr/WordPress) — OpenAI, Midjourney deal, 2024-02
Axel Springer → Microsoft
Undisclosed
Axel Springer — Microsoft deal, 2024-04
Financial Times → OpenAI
$450M ARR
Financial Times — OpenAI deal, 2024-04
Reddit → OpenAI
undisclosed + ad partnership
Reddit — OpenAI deal, 2024-05
Stack Overflow → OpenAI
Undisclosed
Stack Overflow — OpenAI deal, 2024-05
The Atlantic → OpenAI
Undisclosed
The Atlantic — OpenAI deal, 2024-05
Vox → OpenAI
Undisclosed
Vox Media — OpenAI deal, 2024-05
TIME → OpenAI
Undisclosed
Time — OpenAI deal, 2024-06
Multiple publishers (Time, Der Spiegel, Fortune, etc.) → Perplexity AI
Revenue share
Multiple publishers (Time, Der Spiegel, Fortune, etc.) — Perplexity deal, 2024-07
Condé Nast → OpenAI
Undisclosed
Condé Nast — OpenAI deal, 2024-08
FT, Axel Springer, The Atlantic, Fortune, UMG → ProRata AI
50% subscription revenue share
FT, Axel Springer, The Atlantic, Fortune, UMG — ProRata.ai deal, 2024-08
Oxford University Press → undisclosed
Undisclosed
OUP just confirmed it is working with "companies developing large language models" - no other details available
Reuters → Meta
Undisclosed
Reuters — Meta deal, 2024-10
FT, Reuters, Axel Springer, Hearst, USA Today → Microsoft
Undisclosed
FT, Reuters, Axel Springer, Hearst, USA Today — Microsoft deal, 2024-10
Hearst → OpenAI
Undisclosed
Hearst — OpenAI deal, 2024-10
Wiley → Potato
Not disclosed, but SVP and GM for AI Growth mentioned that for a revenue share agreement is a possibility they are open to for AI licensing deals
Used to help build Potato's tools. No details provided, but tools include automated paper review and a lab protocol generator. | This is the first deal announced as part of Wiley AI Partnerships, a "c
14 publishers (LA Times, The Independent, etc.) → Perplexity AI
Revenue share
14 publishers (LA Times, The Independent, etc.) — Perplexity deal, 2024-12
2025
Wiley → Amazon
$100–$150M
AWS has built an open source toolkit for healthcare and life sciences, which "offers a catalog of starter agents and an orchestration framework for organizations to build and customize their agentic s
Johns Hopkins University Press → undisclosed
$5K/title
Content used to train LLMs | - In an email to authors, JHUP executive director stated that an AI licensing contract provides legal protection for JHUP against scraping and pirating
- contract includes
AFP → Mistral
Undisclosed
Mistral → AFP | news_media | text | lang: French | 2025 | rag
The Guardian → OpenAI
Undisclosed
OpenAI → The Guardian | news_media | text | lang: English | 2025 | rag
Associated Press → Google
Undisclosed
Associated Press — Google deal, 2025-01
Axios → OpenAI
Undisclosed
Axios — OpenAI deal, 2025-01
American Association for the Advancement of Science (AAAS) → ProRata AI
Not detailed in announcement, but in 2024 ProRata.ai had a 50/50 revenue split with content licensing partners based on usage
Used specifically for Gist.ai search engine with emphasis on bolstering transparency and reliability. They are not doing broad LLM training, but rather focusing on select, high quality content to keep
NEJM Group → OpenEvidence
Multiyear agreement
RAG model to inform OpenEvidence platform, which specializes in providing current medical research to doctors. | NEJM Group stressed the alignment of this deal with their values, as they hope thier re
New York Times → Amazon
undisclosed, multiyear
New York Times — Amazon deal, 2025-05
Wiley → Perplexity AI
Not disclosed, but AI licensing segment is driving a lot of their revenue growth — Not disclosed
"Students can access assigned Wiley curriculum materials through their institution's Enterprise Pro subscription, eliminating the need to switch between platforms." | Wiley is using this limited use o
Johns Hopkins University Press → ProRata AI
Not disclosed — Not detailed, but copyright holders will be "credited and compensated for their material on a per-use basis" for the propotion of content used to answer queries, calculated by a proprietary AI algorithm
Content is used to power ProRata's Gist.ai search engine answers | ProRata's licensing agreements focus on establishing reputable search results from their licensing agreements and author attribution,
Taylor & Francis / Informa → undisclosed
Not disclosed — Undisclosed
Not disclosed | A third LLM deal was reported in the Q&A session of Informa's Half-Year Results meeting. The LLM partner was not disclosed, though announced as a "different customer" from other AI lic
Wiley → Anthropic
Not disclosed — Undisclosed, requires dual-subscription to Wiley and Claude to access Wiley content and Claude's MCP integration chatbot
Wiley's Model Context Protocol (MCP) integration is part of a Claude for Education pilot program. MCP is "an open standard that will enable integration between peer-reviewed content and AI platforms")
Bloomsbury → undisclosed
20% royalty payment to authors.
Bloomsbury retaining rights to license to LLMs | Bloomsbury has given authors the opportunity to opt-in to potential future licensing agreements with a 20% royalty payment. There have been questions r