In recent months, several publishers have announced that they are licensing their scholarly content for use as training data for LLMs (Large Language Models). These deals illuminate how major publishers are grappling with their strategy amid uncertainty, but thus far they have been unavailable to smaller and medium size publishers. To understand the dynamics around this fast-developing market, my colleagues Maya Dayan and Dylan Ruediger and I are launching a tracker of these licensing deals

painting of a group of people along a rocky coast
“narrative tension even without any identifiable story” — yet, Salvator Rosa, Bandits on a Rocky Coast, Metropolitan Museum of Art

Strategy

Publishers have wrestled with whether and under what conditions to license their content to discovery and aggregation, but the LLM licenses are a different matter. Historically, the key considerations have been about topics such as the money changing hands, exclusivity provisions, and the like. Publishers knew that access to the version of record was essential and the question was only how to distribute it to maximize impact and/or economic value. The early days of the scholarly collaboration networks were a different kind of challenge, since publisher content appeared on some of these sites under questionable circumstances. In the LLM landscape, publishers face not only what they once again see as rampant misuse of their content; but also real questions about the extent to which the version of record will retain its intellectual and economic value if high quality summary and synthesis can be conducted by the machines. 

Different publishers are situated differently for this decision. For example, a subscription focused publisher will be more concerned about whether access through an LLM will over time reduce the economic value of access to the version of record. By contrast, a pure Gold open access publisher may be more indifferent, if the scholarly need for — and therefore economic value of — certification is not likely to disappear in the near term. At the same time, it is possible that both will assess that direct access to the version of record will remain vital for scholarship and economically valuable over the long run. 

So far, as we see in the tracker, several major publishers have announced deals. For them, there is a substantial near-term revenue upside. One that is prominently absent is Elsevier, which was also slower than others to license content for access by certain discovery services, preferring to see how the market developed. These are strategic choices for all publishers — not necessarily low risk.

The Deals

The basic idea behind these deals is to generate revenue for the publishing house in exchange for easy, reliable, and legal access to the content for the LLM. A number of companies are in the hunt for this content, including not only OpenAI and Google, but also Apple and more specialized providers. Investment has been pouring in as a result of the market’s spike in interest in artificial intelligence, and so striking deals now allows publishers to cash in before this investment dries up. Wiley has even established an executive level general manager position to drive forward opportunities here. 

There is lots of background chatter to see what negotiators at one publishing house can learn from those at another, but thus far there doesn’t appear to have emerged a standard set of terms or overall model from which to build these deals. Pricing of course is at the top of mind for everyone but there are many other considerations as well. There are technical and reputational questions about how corrections or retractions will propagate through an LLM and whether an author can opt out, and there are business model issues such as whether provenance will be tracked through the output from an LLM such that a citation or link can be provided back into the scholarly record, just to take several straightforward examples. 

The Tracker

Please take a look at the tracker, which we’ll update periodically. If you are aware of other deals that we have not yet documented in this tracker, please contact me either in the comments below or privately to share details, on or off the record. 

Roger C. Schonfeld

Roger C. Schonfeld

Roger C. Schonfeld is the vice president of organizational strategy for ITHAKA and of Ithaka S+R’s libraries, scholarly communication, and museums program. Roger leads a team of subject matter and methodological experts and analysts who conduct research and provide advisory services to drive evidence-based innovation and leadership among libraries, publishers, and museums to foster research, learning, and preservation. He serves as a Board Member for the Center for Research Libraries. Previously, Roger was a research associate at The Andrew W. Mellon Foundation.

Discussion

4 Thoughts on "Tracking the Licensing of Scholarly Content to LLMs"

As a publisher, our first question was whether the content licensed for use was purely for the model to learn from, or would there be outputs of text and figures directly taken from our published content? We were assured it is purely for learning purposes, but how this would be monitored down the line I am not sure. Thanks for the tracker, great work.

Curious to know if contractually the use of your content was time-limited and that whatever you provided was to be deleted/returned to you at some point?

The contract we were sent was for two years with auto renewal every year. The termination clause is a little unclear. It says upon termination, new agreements (content) will not be added. Preexisting agreements (content) will continue under the terms of the aggregator’s agreement. And that upon termination of the aggregator’s agreement with the AI company content must be expunged (Not the agreement between the publisher and the aggregator, presumably).

To my knowledge Springer Nature have also a licensing deal, but I’m not sure it’s been announced. Perhaps now that they are listed the details may come through in their public accounts.

Leave a Comment