In recent months, several publishers have announced that they are licensing their scholarly content for use as training data for LLMs (Large Language Models). These deals illuminate how major publishers are grappling with their strategy amid uncertainty, but thus far they have been unavailable to smaller and medium size publishers. To understand the dynamics around this fast-developing market, my colleagues Maya Dayan and Dylan Ruediger and I are launching a tracker of these licensing deals.
Strategy
Publishers have wrestled with whether and under what conditions to license their content to discovery and aggregation, but the LLM licenses are a different matter. Historically, the key considerations have been about topics such as the money changing hands, exclusivity provisions, and the like. Publishers knew that access to the version of record was essential and the question was only how to distribute it to maximize impact and/or economic value. The early days of the scholarly collaboration networks were a different kind of challenge, since publisher content appeared on some of these sites under questionable circumstances. In the LLM landscape, publishers face not only what they once again see as rampant misuse of their content; but also real questions about the extent to which the version of record will retain its intellectual and economic value if high quality summary and synthesis can be conducted by the machines.
Different publishers are situated differently for this decision. For example, a subscription focused publisher will be more concerned about whether access through an LLM will over time reduce the economic value of access to the version of record. By contrast, a pure Gold open access publisher may be more indifferent, if the scholarly need for — and therefore economic value of — certification is not likely to disappear in the near term. At the same time, it is possible that both will assess that direct access to the version of record will remain vital for scholarship and economically valuable over the long run.
So far, as we see in the tracker, several major publishers have announced deals. For them, there is a substantial near-term revenue upside. One that is prominently absent is Elsevier, which was also slower than others to license content for access by certain discovery services, preferring to see how the market developed. These are strategic choices for all publishers — not necessarily low risk.
The Deals
The basic idea behind these deals is to generate revenue for the publishing house in exchange for easy, reliable, and legal access to the content for the LLM. A number of companies are in the hunt for this content, including not only OpenAI and Google, but also Apple and more specialized providers. Investment has been pouring in as a result of the market’s spike in interest in artificial intelligence, and so striking deals now allows publishers to cash in before this investment dries up. Wiley has even established an executive level general manager position to drive forward opportunities here.
There is lots of background chatter to see what negotiators at one publishing house can learn from those at another, but thus far there doesn’t appear to have emerged a standard set of terms or overall model from which to build these deals. Pricing of course is at the top of mind for everyone but there are many other considerations as well. There are technical and reputational questions about how corrections or retractions will propagate through an LLM and whether an author can opt out, and there are business model issues such as whether provenance will be tracked through the output from an LLM such that a citation or link can be provided back into the scholarly record, just to take several straightforward examples.
The Tracker
Please take a look at the tracker, which we’ll update periodically. If you are aware of other deals that we have not yet documented in this tracker, please contact me either in the comments below or privately to share details, on or off the record.
Discussion
9 Thoughts on "Tracking the Licensing of Scholarly Content to LLMs"
As a publisher, our first question was whether the content licensed for use was purely for the model to learn from, or would there be outputs of text and figures directly taken from our published content? We were assured it is purely for learning purposes, but how this would be monitored down the line I am not sure. Thanks for the tracker, great work.
Curious to know if contractually the use of your content was time-limited and that whatever you provided was to be deleted/returned to you at some point?
The contract we were sent was for two years with auto renewal every year. The termination clause is a little unclear. It says upon termination, new agreements (content) will not be added. Preexisting agreements (content) will continue under the terms of the aggregator’s agreement. And that upon termination of the aggregator’s agreement with the AI company content must be expunged (Not the agreement between the publisher and the aggregator, presumably).
To my knowledge Springer Nature have also a licensing deal, but I’m not sure it’s been announced. Perhaps now that they are listed the details may come through in their public accounts.
I suspect large publishers who have the content and ability to broker a deal may be best position to do so now or very soon not just for short-term gains but also to contractually ensure they have a say in how their content is being used and will be used in the future. Because it’s not just that “…striking deals now allows publishers to cash in before this investment dries up,” it’s that striking deals now allows publishers to cash in before the large LLMs get their content anyway, without the agreements. I second the thanks for this tool.
Thanks Roger (and Maya and Dylan). I just published an updated/revised version of an early-2024 article that treats these deals as an extension of surveillance publishing:
https://kula.uvic.ca/index.php/kula/article/view/291
I’m sure there will be many more such deals to come.
Thanks Jeff, I’ll share the link.
As a scholarly writer I’m appalled. Routledge didn’t even inform authors, let alone offer an opt out. There was no agreement or any information for authors about how our work is going to be used. No protection.
From correspondence with decision-makers it’s clear they don’t have a concept for royalties. There is no assurance that our work won’t be chopped up and spit out mashed up with garbage scraped from the Internet and other writers’ words.
One stolen book represents 15 years of research and careful development, all original artwork and figures. Now just bits and fragments for billionaires’ profit.
https://salmons.blog/2024/08/05/routledge-sells-out-authors-to-ai/
The tracker should have columns for author opt in/opt out and author compensation yes/no. The bias implict in the construction of the tracker is in favor of the publishers and the revenue generated thereof; not withstanding the note that Sage will pay royatly (bravo).