Editor’s Note: Today’s post is by Amy Brand, Dashiel Carrera, Katy Gero, and Susan Silbey. Amy is Director of the MIT Press and Co-founder of the MIT Knowledge Futures Group. Dashiel is a Visiting PhD Researcher in Computer Science at Columbia University. Katy is a human-computer interaction researcher focused on creativity, writing technologies and the ethics of AI. Susan is the Leon and Anne Goldberg Professor of Humanities and Professor of Sociology and Anthropology at MIT, with an additional appointment in the Sloan School of Management.

The rise of large language models (LLMs) is reshaping knowledge production, raising urgent questions for research communication and publishing writ large. Drawing on qualitative survey responses from over 850 academic book authors from across a range of fields and institutions, we highlight widespread concern about the unlicensed use of in-copyright scientific and scholarly publications for AI training. Most authors are not opposed to generative AI, but they strongly favor consent, attribution, and compensation as conditions for use of their work. While the key legal question — whether LLM training on in-copyright content is a fair use — is being actively litigated, universities and publishers must take the lead in developing transparent, rights-respecting frameworks for LLM licensing that consider legal, ethical, and epistemic factors. These decisions will shape not only the future of authorship and research integrity, but the broader public trust in how knowledge is created, accessed, and governed in the digital age. Here we discuss our survey results, and tomorrow we will offer recommendations for stakeholders.

Research institutions in the United States and elsewhere are navigating a moment of extraordinary potential and profound threat. Generative AI technology including Large Language Models (LLMs) promise transformative pathways for scientific discovery, new knowledge, and accelerated learning. At the same time, they raise urgent questions about the honest and appropriate use of published content. The rights of authors and creators, and the longer-term integrity of our knowledge production ecosystems, are also at stake. These tensions are unfolding against a backdrop of sustained political attacks on science and higher education, declining public trust in institutions generally, rampant mis- and disinformation in our information channels, and growing efforts to commodify and privatize knowledge. As fabricated information thrives, undermining the value of facts supported with reliable and valid empirical evidence, how do we signal trust and verify assertions of truth, including scientific claims? In short, who controls the future of knowledge?

Open book exuding digital code

In late 2024, the MIT Press surveyed ~6,000 of its book authors on attitudes towards LLM training practices and received over 850 qualitative survey responses from authors around the world, working in fields across the academic spectrum, with strong representation in STEAM (science, technology, engineering, arts, and mathematics) areas. The purpose of the survey was to inform the Press’s LLM licensing and partnership practices, in particular to ensure those practices align with the priorities of researchers who publish long-form works. The anonymized, coded author comments reveal deep discomfort with the widespread unlicensed use of their published work to train LLMs. Indeed, many view training on their in-copyright work without consent as a form of exploitation for commercial gain, and the unregulated growth of LLMs as a potential threat to the core mission of research institutions to advance knowledge and pursue truth.

That said, a clear majority of respondents expressed support for well thought-out partnerships between academic publishers and LLM developers, and interest in contributing to LLM-driven innovations in knowledge discovery under the right conditions. Those conditions include licensing transparency, attribution, and consent — principles that align with academic, prosocial values. They also include fair compensation, advocated not only from self-interest but also from an interest in sustaining non-commercial academic publishers like university presses and scientific societies.

The findings reported below offer a powerful foundation for academic and public policies. As institutions committed to the public good, universities should move swiftly and deliberately to establish multi-stakeholder governance and develop evidence-based AI policies and practices that embrace innovation without abandoning long held academic and scientific values. It is essential for institutions of higher learning to strike the right balance in current debates over how scholarly, scientific, and creative work is used to train LLMs, by taking legal, ethical, and epistemic factors into consideration.

LLMs depend on vast quantities of textual and other data for training, much of it currently being scraped without permission from books, journals, and websites. This enables them to perform a variety of useful tasks, such as producing contextually relevant text, inferring information, translating languages, summarizing documents, and answering questions. Published scholarship in science and other disciplines is uniquely valuable as training data for LLMs. Books have been shown to be especially effective in improving model performance due to their high-quality prose and long-form coherence.

While the legality of training on in-copyright works without prior authorization is currently being tested in the courts, the epistemological stakes are already clear. Authors and other creators are not merely “content producers”; they are producers of reliably truthful, original, and understandable explanations about the world and how it works. The experience of reading a book grants the reader genuine insight into someone else’s perspective and a depth of understanding that, for example, a fleeting social media update rarely achieves. As one of our respondents wrote, “Some of the most important moments of people’s lives are in the deep, rich encounters with written work they shape who we are and who we become. Why would we seek to rip this up into an abstracted mess of training data, a series of trivial and often incorrect Cliff Notes and factoids?” Overall, the authors who responded to our survey seek to reap the promise of these new technologies without undermining incentives to consume and produce original longform works.

Survey findings

The MIT Press survey asked authors whether and under what conditions they would support the licensing and use of their work for LLM training. The findings are clear: authors are not opposed to generative AI per se, but they are strongly opposed to unregulated, extractive practices and worry about the long-term impacts of unbridled generative AI development on the scholarly and scientific enterprise.

Strong opposition to unlicensed training

Some authors expressed strongly negative sentiments about the use of their works for LLM training, with emphatic responses such as “Absolutely not!” “HELL NO” and “I am very strongly opposed to having my work used in this way.” These respondents provided a range of objections, from concerns about the trustworthiness and reliability of LLMs (for example, describing them as hallucination-prone machines that produce content “rife with vagaries, misattribution, and error“), to their environmental costs (the technology “uses vast amounts of water, and it accelerates irreversible climate change”), to research ethics (“there is no guarantee that [LLMs’] future outputs will be consistent with the ethical approvals that researchers (including your authors) adhere to”). Concerns about reductionism and epistemological distortions also surfaced; as one author put it, “The research, effort, focus, and human perspective … inherently get flattened or downright erased by such tools. It’s hard to see how the primary ideas of an academic text won’t be mashed together with other ideas, losing all nuance and texture.”

Whereas about 10% of authors were ambivalent or undecided, the largest group of authors (50%) were open to or actively supported licensing under certain conditions. They expect licensing deals to contain appropriate compensation and reliable attribution. Some in this group expressed resignation; as one author said, “I do not agree with the practice, but believe it is inevitable. Authors should be provided with details of our work’s use and receive monetary remuneration.” However, others saw training as a way to spread their ideas (“it increases the chances that an author’s ideas are incorporated into the intellectual landscape”), contribute to a new mode of knowledge synthesis and discovery (“I view it as an opportunity for the work to be integrated into a universal knowledge bank that can be synthesized with other works for positive benefit”), improve the quality of LLMs (“the quality of AI is dependent on the quality of the data”), or open up a source of income either for themselves or for publishers (“AI is making lots of money, so there should be plenty to go around”).

Small minority support unregulated use

Only 3% of authors indicated that they support entirely unregulated use (without consent, compensation, or attribution) of their publications for LLM training, with an additional 3% supporting use without consent or compensation, but only as long as authorship is appropriately attributed. A handful of these authors explicitly noted that they believed training falls under the “fair use” exemption from copyright. As one author said, “I want my work to be part of the written legacy with or without my name attached … Ideas should spread and play with as many other ideas as possible.”

Attribution is a non-negotiable demand

Attribution and credit are bedrocks of academic knowledge production. They are not merely a matter of personal recognition or acknowledgement for effort or creativity; they are the means of identifying and constituting the community whose explanations and evaluations establish consensus on the validity of knowledge claims.

Indeed, the requirement of attribution is a norm in open content sharing too. The most popular Creative Commons license is CC BY, which allows others to use, distribute, remix, tweak, and build upon your work, even commercially, as long as they give credit to the original work. Attribution also provides a record of how ideas connect and build upon one another, by making the links among knowledge producers visible. Standard LLM pre-training — where models train on large amounts of undifferentiated data — makes attribution difficult, as generated text reflects word patterns in the training data without linking back to specific sources. In contrast, approaches like Retrieval-Augmented Generation (RAG) or Model Context Protocols (MCP) give LLMs access to certain data at inference time (rather than during training), allowing source information to be linked to generated content.

When asked, most authors agreed that they would prefer any potential LLM partner to provide reliable attribution to their work if that work significantly informs an LLM query response. Many said publishers should require attribution as part of any licensing deal and require AI partners to find a solution or go without the data. Some wanted AI systems to align with academic attribution norms and stated that they would “be more comfortable with AI use of my work if the systems were held to the same standards of attribution and anti-plagiarism as human authors.”

Some authors were, however, unsure whether LLMs could ever meet such a requirement, or how to define an attribution requirement given the way LLMs are trained. As one author put it, “[LLM] query responses will never be curated/quality-controlled the way an academic text is supposed to be.” Others noted that “the technical problem of ensuring ‘true’ responses is a substantial one” and that “the effect of any single work is spread across billions of elements in the network, and there is no way to trace which work influence[s] which answers,” raising doubts about attribution as a solution to their concerns.

If their writing was used to train LLMs, many authors were also concerned that the models would misrepresent or misattribute their work, for example potentially using their name to spread disinformation. One author noted that “ChatGPT may confidently claim that I said/wrote something that I have not. There have been many examples of that.” More still echoed common arguments against the central conceit of the technology: that LLMs cast the illusion of human-level writing and reasoning by probabilistically stringing together words. This, as one author put it, “fundamentally undermined the entire publishing and knowledge production project.”

Author compensation and the sustainability of the knowledge ecosystem

Several respondents expressed concern that generative models could “reduce the incentives for producing the original work on which these models are based” perhaps leading to a “system which ultimately seeks to make [authors] redundant.” Such a future would not only deter human knowledge production but could also exacerbate existing issues of wealth and labor inequality in the publishing industry, in line with “long standing problems of for-profit publishers profiting on unpaid or barely paid work”. Some opined that it may be more appropriate to broker licensing deals to train LLMs which are publicly controlled and serve the public good. As one writer noted “the tech giants earn billions from research funded with public money and we have to pay for their services.”

Some authors suggested that they deserve significant compensation and floated a variety of compensation models they would expect or prefer, such as one-time payments, micropayments when material is used, annual licensing fees, or a percentage of AI vendor profits. Their interest in licensing deals was contingent on the quantity of compensation they could receive, with proposed compensation ranging from “on the same order of total compensation for book sales” to “in the hundreds of thousands of dollars per book”.  While these authors expected large amounts of compensation (“equal to or greater than any advance I received for the work”), others expected they would get only pennies. Many also indicated they would like to see the publishers benefit from LLM training partnerships, in particular to sustain the kind of mission-driven academic publishing that is already facing financial precarity.

Author questions

Although the survey asked directly about licensing deals only, authors raised a myriad of system-level concerns: how misinformation and misattribution from generative AI may undermine knowledge-production ecosystems, how LLM energy consumption could harm the environment, how AI impacts research ethics, and how licensing deals may contribute to increasing wealth inequalities. Still, many remained optimistic that scientific and scholarly books could contribute to this new technology if publishers help secure licensing and other LLM training partnerships that align with the interests of researchers and research institutions.

Respondents were unclear about which legal frameworks may best serve their interests. Many called upon copyright law to protect their work, e.g., “Publishers should not sell copyright material to AI companies,” versus, “I am happy to share my copyrighted material without charge.” Several noted that copyright issues are still to be adjudicated in the courts, with a small minority explicitly arguing for or against the use of copyrighted material for training to be deemed a “fair use” exemption to copyright.

Similarly, there was disagreement over whether books that are published open access for reading are also open by default for LLM training. Some expressed the belief that “the use for training sets of open access materials is legally allowed” while others noted that LLMs would have to properly honor the open license via “producing derivative works with proper attribution” which is not currently the case for most models. Another author noted that “Open Access (which I support) should not mean Open to Theft of ideas or Credit.”

In tomorrow’s part 2, we offer recommendations for stakeholders.

Amy Brand

Amy Brand is Director and Publisher of the MIT Press, a role she has held since 2015. A cognitive scientist by training, she earned her PhD from MIT and has held leadership roles at CrossRef, Harvard, and Digital Science. She is a co-creator of the CRediT taxonomy, a founding member of the ORCID Board, and producer of the documentary Picture a Scientist. Brand is widely recognized for her contributions to research infrastructure, scholarly communication, and equity in science. Her honors include the Council of Science Editors Award and the AAAS Kavli Science Journalism Gold Award.

Dashiel Carrera

Dashiel Carrera is a Visiting PhD Researcher in Computer Science at Columbia University. His research is broadly concerned with the impacts of AI on the Arts and he runs workshops and gives talks to prepare arts communities for the onset of Generative AI. He's previously conducted research at the MIT Media Lab and Harvard's metaLab and his work has been published in top venues like CHI, DIS, CSCW, Creativity and Cognition, and Digital Humanities Quarterly. Also a novelist and sound media artis, he is the author of The Deer (Dalkey Archive, 2022) and his work has been exhibited at Inter/Access, UKAI Projects, ELO, HackPrinceton and elsewhere.

Katy Gero

Katy Gero is a human-computer interaction researcher focused on creativity, writing technologies, and the ethics of AI. She is a Lecturer in the School of Computer Science at the University of Sydney and previously held fellowships at Harvard University and the Library Innovation Lab. Her research explores how language models impact creative practice, ownership, and learning, with a growing interest in community-driven AI. She holds a PhD from Columbia University and a BS from MIT, and her work has been supported by the NSF, Amazon, and the Brown Institute. Also a poet and essayist, she is the author of The Anxiety of Conception (2025) and co-edits Ensemble Park, a magazine for human-computer co-writing.

Susan Silbey

Susan Silbey is the Leon and Anne Goldberg Professor of Humanities and Professor of Sociology and Anthropology at MIT, with an additional appointment in the Sloan School of Management. She is a leading scholar of legal consciousness and organizational governance, known for her work on how people experience law and how institutions manage compliance and risk. She holds a PhD from the University of Chicago and has received numerous honors, including a Guggenheim Fellowship and MIT’s Killian Faculty Achievement Award. At MIT, she has also served as Chair of the Faculty and played a key role in interdisciplinary governance. Her influential books include The Common Place of Law and Law and Science.

Discussion