Editor’s Note: Today’s post is by Amy Brand, Dashiel Carrera, Katy Gero, and Susan Silbey. Amy is Director of the MIT Press and Co-founder of the MIT Knowledge Futures Group. Dashiel is a Visiting PhD Researcher in Computer Science at Columbia University. Katy is a human-computer interaction researcher focused on creativity, writing technologies and the ethics of AI. Susan is the Leon and Anne Goldberg Professor of Humanities and Professor of Sociology and Anthropology at MIT, with an additional appointment in the Sloan School of Management.
The rise of large language models (LLMs) is reshaping knowledge production, raising urgent questions for research communication and publishing writ large. Drawing on qualitative survey responses from over 850 academic book authors from across a range of fields and institutions, we highlight widespread concern about the unlicensed use of in-copyright scientific and scholarly publications for AI training. Most authors are not opposed to generative AI, but they strongly favor consent, attribution, and compensation as conditions for use of their work. While the key legal question — whether LLM training on in-copyright content is a fair use — is being actively litigated, universities and publishers must take the lead in developing transparent, rights-respecting frameworks for LLM licensing that consider legal, ethical, and epistemic factors. These decisions will shape not only the future of authorship and research integrity, but the broader public trust in how knowledge is created, accessed, and governed in the digital age. In yesterday’s post, we discussed our survey results, and today we offer recommendations for stakeholders.
This is a decisive moment in academic knowledge production. It is clear from the rich and diverse survey responses we collected that researchers who write books are excited overall about the potential of LLM-enhanced research, while they also fear the consequences of unregulated training use in-copyright publications. Academic stakeholders — including publishers, libraries, and university leaders — should consider the following measures to avoid unintended harms and preserve core scientific and academic values.

Engage in multi-stakeholder scenario planning
As the research community navigates complex questions about the use of published content and research data in AI model training, balancing concerns over unauthorized use with the desire to train models on trustworthy sources, multi-stakeholder study is required to clarify the legal, ethical, and practical implications of these issues and offer initial guidance. Ideally, such study would consider different plausible scenarios and options, modeling long-term impacts in order to make informed recommendations on training use of scientific and scholarly content.
Support authorial consent-based licensing
Make clear that authors retain rights over how their work is used. Overwhelmingly, the book authors surveyed were in favor of consent-based licensing, with a preference for an opt-in model in which every author must provide consent in order for their own works to be included in training data. We recommend defaulting to the opt-in model for any book licensing deals, at least until open questions about longer term impacts are resolved. Where opt-out must be used instead, ensure the process is prominently visible and easily accessible, as opt-out regimes can result in effective lack of consent when users struggle to find or confirm opt-out options.
Engage in short-term partnerships
Given the fast-paced nature of generative AI development, engage in short-term deals that are re-negotiated at pre-specified set points, e.g., every two years. Put guardrails in place to ensure faithful compliance with licensing and use.
Require attribution in LLM outputs
Require AI partners to adopt traceable citation systems, and invest in infrastructure for provenance tracking, such as metadata tagging and knowledge graphs. Support approaches like Retrieval-Augmented Generation (RAG) or Model Context Protocols (MCP) as opposed to model pre-training, as these have more reliable attribution mechanisms and are easier to modify if licensing terms change in the future.
Screen potential AI partners for social and environmental responsibility
There are many AI organizations interested in licensing scientific and scholarly work, and their alignment with prosocial values varies. Consider the climate impact and public benefit commitments of potential AI partners, as well as how accessible their models are to different communities, and at what cost. Include these as metrics in Requests for Proposals (RFPs) and licensing negotiations.
Clarify to authors their current legal rights
Fair use in US copyright law allows limited use of copyrighted material without permission for purposes like criticism, comment, news reporting, teaching, scholarship, research, and parody. Clarify for authors that LLM training currently falls into a legal gray area in large part because it can create outputs with the potential to compete in the marketplace with the content it was trained on. Courts will have to determine whether LLM training is fair use on a case-by-case basis, presumably based on detailed economic analysis of its impact on authors and content owners in each case.
An important question at this time is how LLM training relates to “text and data mining” (TDM), understood as a transformative fair use exemption to copyright law. The two practices may appear similar and yet they could be argued to serve different goals, involving different data processing techniques, algorithms, computational resources, learning paradigms, and outputs.
Conclusion
The proliferation of LLMs challenges both the access to verifiable knowledge and the methods of producing new knowledge. Notably, the most popular LLMs currently impose barriers to access through paid subscription. They also tend to be the least transparent. Using published works en masse to train LLMs poses yet additional risks for the further enclosure and privatization of knowledge itself. While better training data can improve models across the board, it does not eliminate fundamental limitations of LLMs, especially as they are subject to gaming, bias, and interference. Furthermore, there isn’t yet strong evidence that the hallucinatory behavior of LLMs is ameliorated by improved training content. The slogan “quality in, quality out,” now common in discussions of LLM training, is sometimes taken as a moral imperative to include published science and scholarship — potentially as a fair use — in LLM training, even though there is a strong argument that LLM hallucination is more of a system feature than a “bug”.
It behooves the academic community to collaborate in the design of generative AI technologies and, simultaneously, to work with legally and organizationally expert colleagues to create evidence-based policies that enhance, rather than diminish, the research enterprise. Institutional decision making in the area of generative AI requires proactive scenario modeling to avoid potential pitfalls of ignoring underlying complexities or focusing on short-term benefits. We can learn here from the recent history of the open access movement, in which the laudable, uncontroversial goal of expanding access to published knowledge resulted in greater privatization and commercial consolidation in scientific and scholarly publication. This unintended outcome is arguably the result of underinformed policy-making that pushed the journals publishing market to change too quickly rather than letting it evolve more gradually.
How, and under what conditions, published science and scholarship is used to train LLMs is not simply a copyright debate. In the end, it may not be governable through current copyright doctrines and pending legal decisions on the question of fair use. At its core, the incorporation into LLMs of published scholarship is a question of who determines and controls what constitutes trustworthy knowledge in the digital era. The inexact correspondence between words and their meanings has itself long been an area of academic study, and LLMs both exaggerate and obfuscate the problem. If the academic community fails to act, we contribute to the divide between communication and truth. We also risk allowing the work of scientists and scholars to be appropriated by private actors who have no accountability to the mission of research institutions. As one author put it: ” Why would I want to enrich, at nearly no benefit to myself, a small number of massive US tech platforms? Publishers like MIT Press are not powerless. They have a central position holding the main thing which firms like OpenAI are desperate to get hold of, i.e., more high-quality training data for their models. This doesn’t just give them a trivial resource to cash in for pennies … it gives them the ability and the right to take a normative stance in shaping the future we are going to be forced to live in.”
In order to help construct new AI systems that support the production of trustworthy, accessible and verifiable knowledge, university leaders, publishers, and other stakeholders must act swiftly — and in collaboration — to build frameworks that preserve the interests of authors, the integrity of science and scholarship, and the future of knowledge itself.
Discussion
6 Thoughts on "Guest Post — Who Controls Knowledge in the Age of AI? Part 2, Recommendations for Stakeholders"
In answer to your headline question – the publishers control knowledge. This is self-evident.
If, as appears to be the case, we cannot trust publishers with our knowledge, it is time we found someone we CAN trust.
Given that AI companies are pirating huge amounts of content without the permission of publishers, is this really the case? (https://authorsguild.org/news/meta-libgen-ai-training-book-heist-what-authors-need-to-know/)
I cannot believe that the sophisticated systems operated by modern publishers have been quite unaware of what has been going on.
And, if publishers are happy to countenance piracy by global re-publishers, they really have no case against the Anna’s archives and Sci-Hubs of this world, who do not even (so far I know) make a profit from their activities.
I’m not quite sure what you’re getting at here. Publishers are very aware that their works are being pirated (as is the case for pretty much every digital file publicly available), and given the many, many lawsuits against AI companies, including several that have shown that the AI companies are illegally using pirated content for training their AIs, I think it’s not accurate to say that they are “unaware”.
And publishers do not countenance piracy, and have similarly filed (and won) many court cases against Sci-Hub and Libgen. Unfortunately, they are run in areas of the world where international laws don’t seem to be followed, and/or they continuously change their hosts once each one is shut down by authorities. Note also that these services have been credibly accused of being run by Russian intelligence agencies over whom publishers have little sway (https://www.washingtonpost.com/national-security/justice-department-investigates-sci-hub-founder-on-suspicion-of-working-for-russian-intelligence/2019/12/19/9dbcb6e6-2277-11ea-a153-dce4b94e4249_story.html).
You’re quite right to take me up on this – I know very well, and had ignored, that there have been court cases brought by publishers against LLM companies, although I didn’t know that there have been ‘many, many’. My apologies.
One judge held that it was reasonable to tear books apart to scan their contents, but not to download them from pirate sites to do so. This presumably means that anyone can do so legally, and sell the contents at a profit, as Anthropic will. It is unhelpful to see that the judge in the Meta case, while tangentially supporting copyright, said that his ruling ‘stands only for the proposition that these plaintiffs made the wrong arguments’.
I am also inclined to believe that Russia, and sympathetic nations, are probably behind attempts to undermine western academic publishing … although since much of Russian education now rests on a warped, rewritten, evidence base it is probably against their interests to support the spread of peer-reviewed, unbiased, knowledge.
Nevertheless, my feeling is that academic publishers have had it far too soft for far too long, and the AI onslaught is entirely deserved. The problem here is that, at the moment at least, it is no substitute for an efficient Open Access system.
My understanding of the more recent case you mention is that the judge (and the judge in a related case) both held that using content for AI training is transformative and fair use (although this is being appealed, is the focus of many more lawsuits, and remains unsettled), but also that 1) AI companies are violating the law if they illegally obtain that content (e.g., pirating it) and are likely liable for copyright infringement damages for doing so, and 2) while the training may not be a copyright infringement, the outputs of the AIs may indeed by infringing (hence the judge saying they made the wrong argument by going after ingestion, rather than outputs): see https://www.ce-strategy.com/the-brief/normalizing/#3
The idea of Russian intelligence interference is less about undermining academic publishing, and more about using it as a doorway to harvest identities and passwords useful for getting into university and corporate systems to grab research and financial data.
All that said, the publishers may not be the victims here (and at least in this article, the questions are more about protecting authors and their rights). Publishers are already profiting from selling legal access to materials for AI training and will continue to do so, even after it may or may not be declared fair use. Tech companies need legal access and they need easy access to huge amounts of content, and will likely pay for that access. Many companies (think Pharma) are going to want to have their own, private, internal AIs that they can train (rather than having their trade secrets exposed to public models), and they’ll license content as well. In the long run though, I think that most academic research is far too esoteric to be of much value in training LLMs, but will be of incredible value for trained LLMs to use for Retrieval Augmented Generation (RAG). In those circumstances, a Pharma company or a university would purchase a general purpose AI, which would then be able to access publisher content for specific queries, depending on which content the university subscribes to. Wiley is already doing this.