A robust conversation has been brewing about licensing and artificial intelligence (AI). It isn’t the lawsuit question, which is pursuing apace and likely will lead in my estimation to some form of settlements for the majority of cases. The other conversation I’m considering is about licensing content for use in AI systems. Several publishers see this as an opportunity. Many authors have serious concerns. While the income from these deals seems big, is it really? Will it be enough? Will it bring any resources to support scholarly publishing and the digital transitions that still lay before it? Importantly, does it protect the interests of authors, particularly those of scholarly authors who care most about attribution? There will be time enough to assess the business opportunity. The questions about the interests of professional, scientific and scholarly authors seem to be less considered.  Will their ideas be accurately represented? Will they get credit for their ideas?

Over the summer, I had the opportunity to speak with several academics who are authors with Taylor and Francis (T&F) imprints about the recently announced licensing deal for AI applications. They expressed some of the same concerns covered in a July article in the Bookseller by Matilda Battersby on the T&F licensing deal with Microsoft and likely other technology companies. The authors complained they didn’t have a say in any deals; they questioned whether they would see any payment; they were concerned about their work being regurgitated by potentially inaccurate machines; and they wondered whether they would get any recognition for their work.

A person putting together a stack of wood blocks that read "Find", "Credible", "Sources" "and Cite" and "Them" atop each other.
A basic component of scholarly research is the citation, and scholarly authors need them in lieu of paid compensation.

In June, John Wiley and Sons announced it had secured a similar licensing deal of “$23 million content-rights project for training generative AI with ‘a technology company’”, which was not identified. Another deal was hinted at as being finalized with “second big technology company” in 2025. Similarly this summer, CCC announced a collective license opportunity for AI earlier this summer, signaling broad interest in securing licensing arrangements between publishers and the technology companies.

While there is justified attention drawn by this new market and the hype that surrounds it, $44 million represents just 2.35% of Wiley’s 2023 gross revenues. Whether these figures will grow significantly is an open question, but given the number of AI developers, the size of the deals announced, and the number of content providers seeking funds from those developers, caution is warranted about the overall income impact across the entire scholarly ecosystem. Obviously, in a publishing industry of modest margins, a 2.5% lift could be the difference between profitability and loss.

Many publishers are entering into license deals with AI tool developers of various sorts. Few publishers are making announcements about the deals (to be fair, licensing deals are generally not the things of press releases), but some companies refer to them in their earnings calls with investors, which, surely, few authors pay attention to. It is also a rare licensing deal that an author has anything to do with or has any say in the matter. This is embedded in practically every copyright transfer agreement. As most authors don’t understand the secondary rights market, most probably skim over the section without a second thought. Even those that considered them closely, likely didn’t envision the clause’s application to AI licenses. It might be worth lingering on those clauses in the future as licensing to AI is one of those secondary markets. In their description of the deal, a Wiley spokesman said, “Wiley authors are set to receive remuneration for the licensing of their work based on their ‘contractual terms’”.

Of course, T&F, Wiley, and the other publishers will adhere to the letter of their agreements and pay the authors their due. If you are an author with any of the larger scholarly publishing houses, it would be worth having a look at your next royalty statement to see if there are any fees paid for these secondary licensing deals with AI companies and how the income is described.

For the most part, and there certainly are exceptions, most scholarly monographs do not generate much of a financial return for the authors. In the monograph space, a slim majority of books are marginally profitable for the publishers and royalty rates on scholarly books range from 5-15%. In an era when most monographs might only sell a few hundred copies, any total profit or total royalty to the author will be modest at best. Into this market, the idea of an untapped licensing opportunity for additional revenue from AI companies should be viewed as welcome, with some caveats.

Of course, the journal business has operated on a very different model where author rights are transferred without any compensation. Increasingly, the trend has been for authors to pay for content distribution through open access article processing charges.

Considering all the authors of scholarly articles as well as monographs, a significant majority of the content creators of scholarly literature don’t see any compensation at all from publishers. In the realm of compensation for their publishing work, it is understood that academics generally see the benefit of authorship through secondary results of having published something. This could be through enhanced reputation and recognition, but most importantly through promotion and professional advancement.

To highlight this, one could reflect on a central tenant driving much of the open access movement since its origins: that author attribution is the only real value that authors should be concerned with. The Budapest Declaration stated, “the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.” The theory is stated like this: Since authors generally aren’t compensated for their contributions, that content should be distributed using a CC BY license to be truly “open access”. Limitations on commercial reuse such as CC BY-NC or CC BY-SA were somehow meaningfully less open and were therefore criticized. Even when ingested by AI for training purposes, for this openly available content the requirement for attribution remains.

A key problem with increasing the secondary license market for academic texts by extending licenses to incorporate those texts into AI tools is it has the potential to break the reputation ecosystem that supports scholarly publishing. Much of this secondary compensation market for researchers relies on the world of citation. Without digging into the well-worn arguments against and earned criticisms about the validity of citation as an assessment metric, it does survive and even flourish. However, if the market for attribution is a key driver motivating authorship, and AI tools are generating content or providing search results without attribution, this could become a significant challenge in our community. If AI tools do not have the capacity to reference back to the source material in their outputs, the reputation economy and the benefits that accrue to authors as a result will begin to fail.  Some of the more popular tools have some form of citation functionality, such as Scite.AI, perplexity and Consensus. Whether one can trust the citations derived from these tools is an important research question. Regardless, these tools are constantly improving and functionality that is unavailable today could happen soon. Publishers could use the licenses they are negotiating to push these tool developers to advance the incorporation of citation in their tools.

It is not just academic authors who care about attribution. It was mentioned in a recent article noting a company-wide email from Condé Nast CEO Roger Lynch. However, attribution certainly means less than revenue for content creators who are paid directly for their writing. Realistically, if AI continues to disrupt popular media and news publishing as search had done in the past two decades, those industries could have trouble surviving in a world where paid authors have to compete with freely generated AI-based text.

If online distribution via LLMs (Large Language Models) and licenses to machine reading tools will be an increasingly important element of online consumption, and presumably the value derived from “reading” all the world’s content is extremely valuable and profitable to AI developers, shouldn’t the contributions to that endeavor be more valued? Imagine if all the world’s search, discovery, analysis and regurgitation of content happens through AI systems and not traditional usage of the titles by people obtaining and then reading a book. Much like AI search that provides the answer rather than a link to a page that has the answer, there should be a concern that people will begin to skip the stage of actually reading the original content and possibly not even know it came from Author So-And-So. In this situation, the scholarly author will see some fraction of actual sales through licenses in future royalty statements but may also lose out on the attribution that is the main source of their compensation for the work.

There are likely only a couple ways the tech community will be driven to incorporate attribution and the connectivity of culture and science. The first is because users demand it, which is not likely in most cases. Perhaps in a research context and for research-specific tools, but far less likely in most other general interest tools. And as we have seen with consumer technology, the popular use cases are far more likely to become the norm than the scholarly option.

The second approach to force attribution in AI tools would be through litigation. Here again, the interest from most of the lawsuits around AI is in monetary damages for copyright violations from content creators, who are accustomed to being paid for their work. Even though the case is seemingly just as strong as the cases focused on “all rights reserved content”, the lack of adherence to the copyright stipulations regarding open access content isn’t likely to drive litigation. There are not likely to be many lawsuits alleging a breach of the attribution clause in the CC BY licenses that adhere to most openly available content.

Finally and possibly the best strategy would be to include a requirement for citation and attribution in the license to use the content. This is an important opportunity for the publishing community to engage AI developers and insist that attribution become a vital component of any future technical development of AI tools.

In this way, the modest details about T&F deal are significant and should be commended as an important first step in this regard. The announcement details efforts to “collaborate to further develop automated citation referencing, using the latest technology to improve speed and accuracy” and “alignment on the importance of detailed citation references.” It would be interesting to see how detailed these sections of the deal are and whether the “collaboration” described extends to contractual requirements in AI output requiring citation functionality as much as it has “limits on verbatim text extracts”. Certainly, the contract covers specific details about the latter, but I expect the former is a bit vaguer. From the perspective of the author community, the former is more important than the latter. It is relatively easy to recognize a “copy and paste” extract and to put limits on how much is done. Technically, it is much harder to generate an output that accurately states “This section is a summary derived from this author and her ideas as described in her book, ______” as one my do in a traditional reference.

Hopefully, other publishers will keep in mind the interests of their authors when negotiating and agreeing these licenses. Many publishers will. However, it should be core to every license signed by every publisher or intermediary with any AI tool developer. No license agreement should be signed with an AI company unless it is explicit that attribution is a requirement. There is more to the scholarly publishing market than dollars, particularly for the authors.

Todd A Carpenter

Todd A Carpenter

Todd Carpenter is Executive Director of the National Information Standards Organization (NISO). He additionally serves in a number of leadership roles of a variety of organizations, including as Chair of the ISO Technical Subcommittee on Identification & Description (ISO TC46/SC9), founding partner of the Coalition for Seamless Access, Past President of FORCE11, Treasurer of the Book Industry Study Group (BISG), and a Director of the Foundation of the Baltimore County Public Library. He also previously served as Treasurer of SSP.

Discussion

5 Thoughts on "Ensuring attribution is critical when licensing content to AI developers"

Excellent post by Todd Carpenter as he outlines the issues and potential solutions to the use of published content to build LLMs. There are still issues to be addressed as we have witnessed a number of published articles have been retracted for a variety of reasons. Once these articles are part of the corpus of data for the LLMs the truth of the corpus is stained. Will the LLM developers go back and remove the retracted papers? Will they advise the users of their LLM if this activity?

Once this genie is out of the bottle it will be challenging to correct.

The STM publishing industry must come together to discuss these issues and create principles that will ensure that any LLM’s corpus of data is pure(truth)!

Great post! I had not thought through the importance of attribution issues in this context, that “there is more to the scholarly publishing market than dollars, particularly for the authors.”

In that same light I’d highlight, and even challenge, your “attribution certainly means less than revenue for content creators who are paid directly for their writing.” Regardless of fees paid (and they’re never much), attribution/recognition of any sort is endlessly important, for all authors, for all kinds of reasons.

Thank you Thad for your comments. To be clear, I agree that attribution and recognition are always important regardless of how you are compensated, whether it is directly from the publisher (or even the consumer) or via secondary means like professional advancement. My point is that those who lack the benefit of a full-time position providing income and stability are more focused on the compensation from their writing. One important value of credit for those who do rely on income from their writing is that positive attention often leads to more opportunities to write.

Great and provocative article… seems like a ground truth at first….

The challenge as I see it is that AI attribution via citation, or other means gets confused in a generative model where an incredible number of sources are considered and combined via the algorithms to provide the eventual answer. Which should be given to attribution when varying points of view have been reviewed and the programs reach the consensus response? I understand the idea that directly quoted paragraphs or even sentences should aways be attributed. It seems straight forward to do so until we realize that the computers have found that quoted material in several places. Should all sources be quoted? Even those quoting the quotes? Redundancy in the data input causes confusion in the identification of sources in very Large Language Models. Training vectors built on training vectors quickly remove the information tidbits from the source links. Where the corpus is small – say less than 200,000 articles, pathways to the source are still clear. But when the LLM is millions or even billions of bits of source data it is less clear which and where the data originally originated. Even the original material is altered in the ongoing operations and so the source is different. So far, we discourage quoting the LLM and Chatbox responses. I agree that attribution is the main reward for most authors, especially those in learned and scholarly areas. Optimistically improved reputation and eventually possible renumerations follow. Attribution comes mainly via citation which to date LLM’s are poorly suited to provide. They will however, when asked to provide a bibliography for their response. Perhaps that would work for now as a resource citation a.k.a. attribution (without compensation).

The compensation questions should be answered at the time of publishing and rights assignment along with the agreed compensations models. When the rights are reassigned or crawling / ingesting allowed by the LLM those rights and associated compensation should follow the published work into the LLM. I am guessing there is some remedial work to be arranged for the works already ingested.

Not everything gets a DOI, there are costs and metadata requirements that go along with them as well as administrative activities. But anything could be assigned some kind of PID, even at the paragraph level for LLM ingestion, The LLM itself could set up such a system. It already tracks snippets and vectors, why not content sources?

Leave a Comment