Editor’s note: Today’s post is by Jonathan Woahn, Co-founder of Cashmere.io, which helps publishers safely and responsibly monetize their content with artificial intelligence. Reviewer credit to Chef Tim Vines.
Part One: The Missed Expectation
In October 2025, at the FIPP World Congress in Madrid, a ballroom of media executives listened as journalist Ricky Sutton finally asked the question that captured both the anxiety and the hope of the entire room:
“Shouldn’t your AI system pay every time it references our content? Shouldn’t publishers be paid each time there’s a query for content, not just when an ad is served?”
The question was crisp, reasonable, and rooted in a decades-long erosion of content economics. Publishers have watched value slip from their hands through unbundling, aggregation, and search. They are now hoping — some quietly, some explicitly — that AI will reverse the trend. If artificial intelligence is going to ingest their work, understand it, and depend on it, then surely AI companies should pay for that privilege.
The question was directed at Tom Rubin, OpenAI’s Chief of Intellectual Property and Content. His answer was careful, neutral, and ultimately forgettable — not because the question was misguided, but because Rubin understands an uncomfortable truth: the economics that publishers are hoping for don’t exist today.
This exchange revealed the core misunderstanding shaping today’s debate. Many in the industry believe AI companies are about to become a new class of bulk content buyers: predictable, recurring, and highly motivated to pay for data at scale. But the reality is far more constrained. AI companies will not — and cannot — be the primary purchasers of content in a sustainable way.
In this two-part series, we first focus on the missed expectation: why AI will not become the content windfall the way many in the publishing industry hope. The second article will explore the historical playbook that reveals where the real economic opportunity lies, and describe the path to get there.

I. The Great Expectation — and the Wrong Customer
For the past two years, the industry has been awash in headlines that create the appearance of a new licensing boom. OpenAI announced deals with the Associated Press, Financial Times, and other major media groups. Anthropic begrudgingly struck agreements with professional authors and academic publishers. Google relied on privately negotiated licenses for Gemini and its Search Generative Experience. Perplexity launched revenue-sharing programs for premium publishers.
From the outside, it resembled the opening chapter of the streaming wars: large platforms racing to secure content, armed with big checks and strategic urgency. Beneath the surface, the economics tell a very different story.
Every leading AI company is burning enormous amounts of capital. Their cost structures — GPUs, proprietary chips, power consumption, data center construction — are unprecedented. Even the companies reporting strong corporate earnings like Microsoft and Google are doing so on the back of legacy businesses, not AI profitability. The AI divisions themselves are still deeply unprofitable.
A simple truth follows:
A market cannot scale if the buyers cannot afford to participate.
The licensing deals of the past two years have not been funded by healthy revenue. They have instead been funded by strategic budgets, competitive signaling, and investor subsidy. These deals represent experiments — not the foundation of a sustainable marketplace. To understand why these deals are unlikely to become recurring revenue for publishers, we need to dig into how AI companies and their models interact with content.
II. A Tale of Two Markets: Training vs. Inference
To most observers, “AI uses content” feels like one continuous event. A model reads material, absorbs information, and later produces answers that reflect what it has learned. But inside the industry, these are two different markets — training and inference — and they operate under entirely different economic and legal conditions.
A helpful analogy is a student in a professional program — say, engineering.
Training: The Education Phase
Training is the long period when the student studies textbooks, reviews examples, and internalizes the foundational concepts needed to think like an engineer. The goal isn’t memorization; it’s pattern formation and conceptual understanding.
That is what training an AI model is: Exposure to content. Pattern formation. Generalize the patterns into concept acquisition. Rinse and repeat.
Inference: The Practice Phase
Inference is when the student graduates and becomes a practicing engineer — they are handed a real problem and asked to solve it. This is the moment that value is created for clients. Inference is where firms bill clients. It is where learned skills become services.
When a user asks an AI system a question, the model is no longer “learning,” it is inferring.
Why This Distinction Matters Economically
Educational exposure and professional practice follow different economics:
A university doesn’t get paid every time an engineer uses a concept learned years ago. The billable work happens when the engineer applies those concepts to solve a real problem that someone has right now.
Training is education. Inference is practice. Only the latter generates recurring revenue from clients.
III. Training: A Market That Looked Promising but Won’t Scale
The belief that training would become a major licensing opportunity rests on a straightforward intuition:
If AI companies learn from content, they should pay for it.
On its face, the argument is compelling, more so if we lean into the student example we examined previously. After all, students pay hefty tuition fees and buy lots of textbooks.
But four forces — legal, economic, technical, and geopolitical — make clear why training revenue will be episodic rather than reliable and recurring.
1. The Economics of Training Are Fundamentally Front-Loaded
Training is extraordinarily expensive. AI companies spend hundreds of millions building training corpora, assembling proprietary datasets, running multi-month training cycles, and maintaining massive GPU clusters. Because training is such a large cost center, every technical and economic incentive pushes in the same direction: minimize the amount of fresh data required, reuse what has already been collected, and avoid perpetual acquisitions wherever possible.
Once a company builds a high-quality dataset, it becomes the backbone for multiple generations of the model:
- pretraining
- continual learning
- fine-tuning
- domain adaptation
- successor model families
Instead of seeking more content, companies optimize their pipelines to extract more value from the same dataset.
2. China and Open-Source Models Undercut Pricing Power
A growing share of frontier-model innovation now comes from Chinese labs — Baidu, Alibaba, 01.AI, DeepSeek, and others — and their models are trained on enormous corpora of unlicensed material. Many of these models are competitive with Western offerings and are increasingly released as open-source alternatives.
This creates a harsh pricing reality:
Why would Western companies pay premium rates for fully licensed corpora when their global competitors train comparable systems for free?
Even labs that want to train responsibly are squeezed by this environment. If the cost of “doing it right” slows their release cadence, raises their expenses, or results in even marginally weaker models, they risk falling behind competitors who don’t operate under the same constraints.
To be clear, this is not a defense of those practices, nor an argument that they are acceptable. It is simply the competitive landscape Western companies face. The existence of high-performing open-weight models trained on unlicensed data — whatever one thinks of their provenance — imposes a ceiling on what the commercial market is willing to pay for training data.
The pressure is structural, not moral. And it sharply limits the pricing power publishers can expect in the training market.
3. Courts Are Trending Toward Fair Use in Training
Two major rulings last summer — one involving Meta, one involving Anthropic — offered the clearest judicial signals yet around training-phase copyright.
In the Meta case, the court ruled that plaintiffs had not shown market harm from Meta’s use of books to train Llama. Because the model did not reproduce or substitute for the works, the court characterized training as transformative.
In the Anthropic case, the court drew a pivotal line:
- Training on lawfully acquired content: likely fair use
- Aggregating pirated books into a library for training: not fair use
These rulings expose the core legal challenge:
Unless a model regurgitates copyrighted text, plaintiffs struggle to prove the kind of market harm required to win.
The judicial trend increasingly favors the idea that training — standing alone — is transformative and therefore protected, which weakens the structural basis of a market for large-scale training-phase licensing.
4. The Future of Training Is Smaller and More Specialized
A final constraint is technical. The era of “just pour in more data” is ending.
AI leaders across the industry have acknowledged that frontier models have already consumed most of the high-quality text available on the public web. When you’ve already trained on the vast majority of useful public text, the marginal value of adding more general-purpose data declines rapidly. The next improvements in model capability will come from:
- highly specialized domain corpora
- well-structured technical datasets
- targeted refreshes rather than massive new ingestions
- data with deep internal organization, not broad volume
This is a very different market than many publishers imagine. It is not a recurring, everybody-wins licensing ecosystem. It is a narrow, specialist market where value is concentrated in specific domains at specific moments.
From Pattern to Practice
Training will remain part of the licensing landscape, but its ceiling is clear. Economic incentives push AI companies to minimize it, global competition limits pricing power, courts increasingly treat it as fair use, and technical progress reduces the need for bulk new data. These forces do not eliminate the training market, but they do define it: episodic, constrained, and incapable of supporting the broad, recurring revenue streams publishers are hoping for.
IV. Inference: The Economics Publishers Already Understand
If publishers want to see where AI can support a sustainable content market, they don’t need a new business framework. Many already operate inside one: the academic journals market.
Journals exhibit core traits of a healthy market:
- recurring demand
- user-driven value
- measurable interactions
- established monetization
- strong attribution
The economic event ties to each individual use. Need, value, and usage recur. Inference behaves the same way.
When a student clarifies a method, a clinician checks a concept, or a researcher verifies a model, the AI system is performing an action that depends directly on authoritative content. Each interaction is:
- discrete and attributable
- measurable
- tied to user value
- and highly repeatable
Inference is driven by ongoing user need — not by a platform’s one-time training decision.
Figure 1: Market Comparison
| Market Signal | Training | Inference |
| Demand | One-time or infrequent; front-loaded | Continuous; every query creates demand |
| Buyer Base | Few AI companies labs | Broad: billions of users + institutions |
| Attribution | Weak, difficult to prove | Strong; traceable |
| Monetization | Limited | Multiple paths |
| Incentive Alignment | Labs seek to reduce data costs | Platforms need authoritative content |
Inference has the signals of a durable market. Training does not.
Intermission
We conclude Part One with a singular message: the training market will not become a dependable source of revenue for publishers. It is finite, episodic, and shaped by buyers whose incentives push them to reduce — not expand — their reliance on paid data. For many publishers, the meaningful upside from training may already be behind them.
The more important story is what comes next.
The real market is only now beginning to emerge, and it centers on inference rather than training. In Part Two, we will examine the historical patterns emerging platforms follow, why AI is entering the same trajectory, and where publishers should position themselves as inference becomes the center of gravity and presents the opportunity to reset the terms of engagement with technology providers.
Discussion
5 Thoughts on "Guest Post — AI Isn’t Going to Pay for Content … At Least Not How You’re Hoping It Will"
This is a clarifying piece that cuts through a lot of wishful thinking. The academic journals analogy for inference is particularly strong with recurring demand tied to measurable use. One question: if inference is where the value is, who actually pays? The AI companies per query, the end users through subscriptions, or enterprises through licensing? The buyer identity seems like it changes everything about market structure. Looking forward to Part Two!
Thanks for your question Steve, that’s exactly where I was hoping this article would lead, and is the precise subject of Part 2 (coming next week!)
Without giving away too much, the short answer is—the same audience that’s paying today. The format, structure and methods will likely be different, but the audience that finds value from the content is not changing. It’s a cycle I call “The Great Reallocation”.
Great article!
I just want to note that the author does what I struggle to avoid doing, which is to use words that imply that GenAI is conscious/sentient. “A model reads material, absorbs information, and later produces answers that reflect what it has learned.” And later, the use of “understand”. I use these products heavily in my work and personal life, and there are times I think, “they have to be lying to us about not having build an AGI, that it’s just pattern matching!”. I had a GenAI tell me during a back and forth “discussion”, “You are overthinking this.” How could that possibly come from pattern matching? Yet, the rational part of me says, “it really is just patterns, it’s just that it’s on a scale so vast that humans can’t begin to relate to it without injecting consciousness into the perception of what’s happening.” Then of course, there is the concept of “emergence” which is fancy way of saying, something could happen with regard to intelligence/sentience that the programmers never intended, because of that scale.
So should we agree that when we use words about how the genAI “learns” and “understands”, that there’s a big asterisk to a footnote saying “metaphorically of course”? Or do we continue to fight to avoid those words?
It’s a fascinating question to entertain, for sure.
I chuckle to say this, but in the “early” days of generative AI (all the way back in 2023), it was much more obvious. ChatGPT was just regurgitating based on basic input and it really felt like there wasn’t anything thinking taking place.
But now as these agents have advanced, the infrastructure has improved plus leveraging “reasoning” models—the game has changed. They are literally taking input, reasoning with it, determining next steps, planning actions, executing them, gathering new information, and cycling with it to generate answers. If you haven’t used Claude Code, you should try it just so you can see what “AI thinking” looks like. It’s incredible, as the steps so closely mirror my own experience in how I would attack problems. It’s just much, much faster.
In a very real sense of the word, they _are_ thinking. But it’s not the way we think, and it’s not the same as the way we learn. When humans think and learn, it’s like we’re fundamentally re-writing our underlying foundational model at a code level—or how training works with LLMs. The difference is as humans we do it in real time, all the time.
Let me know where you land on your fight! Until then, keep it up 😀
Very insightful piece. Looking forward to Part II.