Editor’s Note: Today’s post is by Stephanie Decker. Stephanie is a professor of Strategy at Birmingham Business School, UK. Her work lies at the intersection of organization studies and historical research, focusing on using historical methodologies and digital archives for researching organizations. At the British Academy of Management, she leads on Open Access and contributes to the society’s current work on the potential benefits and limitations of artificial intelligence (AI) in research and education.
When the Open Access (OA) movement gained momentum in the early 2000s, its proponents envisioned a world where research findings would be freely available to all readers — breaking down paywalls that limited knowledge dissemination and hindered scientific progress. By the 2010s, an increasing number of countries began to mandate open access publications, arguing that publicly funded research should be freely available to the public. The readers envisaged by proponents of OA were obviously human (academics as well as the wider public). While text mining had been considered as one potential application, they could not foresee the development of large language models (LLMs) which would begin to rapaciously ingest large amounts of text. OA literature has become particularly attractive for AI training precisely because it lacks the legal and technical barriers that might protect traditionally published content.
Barriers to using copyrighted content have not, however, provided effective protection (or remuneration) to published authors in other areas. As media reporting on ongoing lawsuits has made abundantly clear, the major AI models that have transformed the digital landscape since Chat-GPT’s launch in 2023 were trained on copyrighted material on a large scale. To the casual observer, this may appear an obvious breach of copyright. Yet, transformative use cases have previously enabled Google Books’ (Authors Guild v. Google) scanning of millions of books for search functionality. The transformative use doctrine emerged prominently in 1994 with the Supreme Court case Campbell v. Acuff-Rose Music, which established that a commercial parody could qualify as fair use because it transformed the original work. In academic publishing, this has typically covered uses such as quoting portions of articles or books or repurposing content for educational purposes.
However, AI training goes further by not just indexing content, but learning to generate similar content. The ongoing litigation involving companies like OpenAI and publishers/authors may establish new boundaries for what counts as transformative use, especially whether the commercial nature of many AI systems weighs against fair use claims. While this would clearly affect paywalled academic content, the picture becomes far more complicated when we consider OA publishing and its many variations, which is increasingly becoming the norm in academic publishing.
Most open access academic content is published under Creative Commons licenses, with CC-BY being the most common. These licenses were designed with human readers and human reuse in mind:
- CC-BY: This license enables reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
- CC-BY-NC: This license enables reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution is given to the creator.
- CC-BY-ND: This license enables reusers to copy and distribute the material in any medium or format in unadapted form only, and only so long as attribution is given to the creator. (can be combined with the previous one into CC-BY-NC-ND)
In a recent consultation on OA for the UK’s Research Excellence Framework, the British Academy of Management, of which I am currently a director, and the British Academy (which represents many professional associations in the social sciences and humanities) jointly raised concerns around both the impact, and the lack of any meaningful discussion, of AI reuses of OA content. While OA licenses clearly allow for machine learning, researchers opt for OA to maximize human readership and reuse, and arguably not to provide free training data for commercial AI companies.
Current licenses, whether traditional copyright or the various CC-BY options, would benefit from explicitly considering what machine reuse means as opposed to human reuse – or indeed machine-mediated human reuse. When AI models ingest text, this is disaggregated into “tokens” (i.e., words) that are transformed into neural networks, which in turn form the basis for query responses. Text is converted into statistical patterns – which does not fit into traditional categories of copying, distribution, or adaptation as it employs aggregation on a massive scale.
This novel use of scholarly publishing highlights that significant economic value can be extracted from free academic content. It is astonishing how the UK research funding bodies, for example, can hold a consultation on OA in 2024 that pays no attention to AI models being trained on academic publications and whether non-commercial restrictions (CC-BY-NC) should be considered more widely. While OA and open data were envisaged to benefit economic development, the focus has been on “small and medium-sized enterprises and unaffiliated researchers [to] have the widest and cheapest possible access to scientific publications of the results of research that receives public funding” (European Commission Recommendation 2012). Not only do the AI companies currently developing LLMs, with their significant venture capital backing, represent a very different class of beneficiaries, their reuse shows distinct signs of significantly intervening in the academic research system in ways that makes it potentially damaging to academics and the society they serve.
It is time for OA proponents to engage in public debate with academic associations, universities and national funding agencies, because the widespread use of academic content in AI models poses significant risks for the research ecosystem. As AI becomes more deeply embedded in academic research practices, it may significantly disrupt knowledge creation and attribution standards, disrupting careers and decontextualising research insights. AI outputs, especially when subsequently cited by human researchers, become a form of “citation laundering”, where original sources are hidden or misattributed through AI generation. Not only will this make it impossible to trace the intellectual lineages of ideas, but the researchers who built foundational knowledge will no longer receive credit for their innovation. Moreover, it is likely to be particularly disruptive to interdisciplinary research, which may miss important cross-disciplinary linkages, and which is of particular relevance when it comes to addressing the global challenges that humanity faces.
Academic citation practices ensure that building on prior work is transparent and makes insights subject to critical review and fact-checking. Unlike a human author who might cite specific sources, AI systems typically don’t maintain clear links between their outputs and specific inputs. New content is generated by synthesizing from many (unattributed) sources. Consequently, the link to underlying sources is interrupted and the origin of ideas, including through recombination of existing insights, becomes impossible to reconstruct.
AI models were not built primarily for academic use, but as their impact on education has become enormous in less than two years, we need to urgently address how we are training future generations of human learners and researchers. Expanding cultural practices of instilling AI literacy and citation attribution is unlikely to be enough. Models could be expanded to track attribution of ideas to distinct academic papers and develop provenance fingerprinting – which would require a policy framework for academic publishing that mandates citation capabilities for AI models. Some of this may well be ongoing work – recent updates (February 2025) to several AI models, for example ChatGPT and Claude.AI, have led to improved citations and fewer hallucinations.
Nevertheless, the central issue remains that commercial AI companies extract significant economic value from OA content without necessarily returning value to the academic ecosystem that produced it, while at the same time disrupting academic incentive structures and attribution mechanisms. We always pay a price for technology, as the cultural critic Neil Postman pointed out in 1998, and all great technology embodies epistemological, political or social prejudices. And as AI technology is about to change everything, it is already clear that it is set to disrupt our information economy in fundamental ways: the much maligned “essay mills” that provide students with their ready-made assessments are collapsing, and business models based on exploiting the attention economy through strategically placed digital advertising or search engine optimization will likely follow suit once Google rolls out its “AI Overviews” to preface or indeed replace ranked search results. Neither are a great loss, but their demise is symptomatic of the wider problem facing any profession or industry that produces knowledge or information.
The very benefits of research that OA is supposed to make available to society at large may be undermined in the medium term. At minimum, OA licenses could mandate that AI tools trained on OA papers need to include citation capabilities in exchange for the free use of high-quality material, to ensure that the creators of academic are appropriately recognized. As OA publishing is rapidly becoming a significant share of academic output, OA policies around AI use could reshape the technologies engagement with research substantially. Otherwise, the underlying incentive structure of academic careers could be quickly eroded through widespread AI use, disproportionately affecting a new generation of researchers. Going forward, we may need to consider alternative impact indicators for AI reuse alongside traditional citation counts, if these become artificially depressed by extensive AI reuse without attribution. These are urgent issues for scholarly publishing, especially when it comes to OA, which will have an impact on future generations of research users worldwide.
Discussion
14 Thoughts on "Guest Post — The Open Access – AI Conundrum: Does Free to Read Mean Free to Train?"
As you consider economic value, be careful not to exclude academic-adjacent uses. Every academic year, I work with a computer science capstone course, where groups of students work on a real project with people like me serving in the “client” role to develop software that if they complete it, will have real value to my library and possibly many others. I always make sure we have an open source license on that project. After this year’s project completedly (successfully) the prof asked me if I could come up with a genAI related project for next year. Being a librarian, of course my mind went to library-related ideas, especially around helping students take their poorly-formed paper topic idea (esp for English 101 composition assignments) and help them figure out how to narrow it down and create a proper boolean search for our discovery service. Would OA books and journals get included in that training data? I would certainly love it if it could. And it might help a huge number of students, not just at my university, but everywhere that academic librarians want to adopt this tool. Of course that assumes the students succeed, but if they don’t, the issue of the training data is moot anyway as the whole thing just disappears after the CS students get their final grade.
I would assume that OA books and journals are already included in the training data for most major AI tools (it’s very opaque, but it’s a fair assumption). As the Atlantic article from a few weeks back showed, Meta used a wide range of copyrighted materials and a substantial number of research papers (available through Green OA anyway, but in this case part of a pirated collection) to train their tool, Llama. At least half of my articles are in there and were therefore part of the training data – a colleague of mine said around 50 of her books and articles are in there.
It is astonishing how the UK research funding bodies, for example, can hold a consultation on OA in 2024 that pays no attention to AI models being trained on academic publications and whether non-commercial restrictions (CC-BY-NC) should be considered more widely.
I don’t think this is astonishing at all, when you consider the degree to which a particular orthodoxy has taken hold within the UK research funding community. The orthodoxy is that OA (meaning OA with CC-BY licensing or the equivalent) is an absolute good and is the only acceptable mode of scholarly publication. Once that orthodoxy is accepted, there is little incentive to examine carefully the possible downsides of “a global transition to open access and unrestricted access and reuse.” Talk of unanticipated costs, unintended consequences, and other downsides just gets in the way.
Good point, Rick, although it does also show how, through lack of real engagement with commercial publishers (except to denigrate them at times), so little thought was given by funders to possible consequences of mandating CC-BY licences, such as copyright protection. At least in many publishers’ contract terms, the typical phrase ‘…and any current or future technology not yet invented…’ was included in the assignment of rights clause.
I agree on the orthodoxy point – I see no problem with the public use of academic work under an OA license for the public good, but enabling commercial use? That, I have always found mind-boggling, and the economic argument behind it assumes that the commercial uses will benefit the societies that invested in OA… But there is no reason why this should be the case: First, the marketplace for knowledge is global (and the whole world has not signed up to OA); second, who benefits depends on institutional frameworks that can foster or inhibit economic inequality (the US is a good example of this).
I have thought the AI usage of CC BY work to be a bit of a blind spot but also one that no one really cared about. So I am glad this article exists. However, I think it misses some hard truths about OA/open licensing.
Firstly, AI reuse is entirely consistent with the thought process of OA advocacy. The public has already paid a researcher’s salary. So the output should be free. Just because many of us don’t like AI doesn’t change this basic logic. And so long as AI companies pay as much tax as the next company, they have contributed their bit to the academic ecosystem from which they extract value. In other words, there is no “central issue” and there never was.
Secondly, there is no solution to this ‘problem’. I assume all my papers have been ingested and used to train AI. If I signed a CTA, the publisher has already done a deal and gotten paid (none of which came to me or my institution). If I published a born free OA paper under a CC BY licence, I am also getting nothing. And I have no way of enforcing the failure to cite because I don’t know when enough of my paper has been used to warrant attribution. And even if I did, I don’t have the resources or institutional backing to take on some American AI giant.
So I don’t use or love genAI, but unfortunately allowing uses I might not love was part of the OA bargain that I (and many others) signed up for.
Well, I am somewhat dubious about the claims that the “public” pays my salary – it really doesn’t, I work at a business school in the UK. If it were about who paid for it, then Chinese parents should have the near exclusive right to read UK-produced academic research. Otherwise, why is there a major funding crisis in UK HE if the taxpayer is paying for it? The reality is that business studies expanded massively through foreign student recruitment, and we have been cross-financing the entire HE sector for years. The OA “the public pays for it” argument may work in continental Europe, but elsewhere, not so much.
And, as someone teaching business, the idea that giving away your IP for free will be good for your (own) economic development in a globalised economy is somewhat utopian (even more so in the current age of geopolitics restructuring the global economy).
I agree. It seems to me that a lot of this furor over AI training is basically some people seeing someone else making money (yes, off someone else’s work but work they were already paid for) and thinking they want to figure out how to get a piece of that action.
Exceptions to copyright (fair use/fair dealing) exist exactly to prevent copyright holders from trying to extract money for other people’s creative and non-duplicative (aka “transformative) uses of that content. AI is showing us the incredible potential that such “transformation” can hold for the betterment of all people. We are already starting to hear stories about AI systems trained on biomed literature finding “new” treatments that in theory humans could have found but couldn’t synthesize the huge volume of scholarly writing. It might never have been found without AI, or at least not before many humans suffered or even died.
It seems there are two ideas central to this article. 1) We need citations to the sources in the LLM content. 2) Is it okay for LLM’s to crawl OA Content?
I agree that without citations and paper trails we lose a great deal of important information. Fortunately, the LLM’s are beginning to realize this as well. However, it is particularly hard to build a huge vector / co-occurrence model and then cite one of a few sources for a “conclusion” when there may be hundreds of partial sources used to arrive at the final summary of information presented. We will have to expect that if a “citation” is provided it may be just a weighted or ranked one and not at all comprehensive. Google Gemini and others are now providing back links a.k.a citations in their search summary results. They are representative, NOT complete bibliographies. However, if the LLM provides citations to its sources is it quickly susceptible to copyright infringement litigation. Which is not an incentive to give citations.
It is free to read OA now. It was created as model for very broad distribution and accessible content for essentially free. As the LLM’s continue to become the next generation of search, “conversational search systems”, I believe people will quickly move to asking (begging) that their information be included in the training sets, scraped automatically and regularly. It was not so long ago that publishers changed from telling Google “Don’t Touch my Content” to now working hard to ensure that it is included in Google results. The reason is the exposure that the Google Scholar and others provide for awareness and distribution of works and leading back to scholarly websites is crucial in today’s world. Even Library OPAC’s find that known title searches are done on Google first to find the reference and then entered into the local digital library system to get access to the actual piece (OPAC search is reliably difficult).
If only the ideas of the free web are represented in the training models readers will only receive answers from a biased set. Good research will be excluded, and the populace will read and learn only the theories and concepts resented by those who flood the airways with their point of view. Which would you rather have, political entities and regimes providing the content or peer reviewed accredited content included in the LLM’s? For the Common Good… I prefer the OA be open to alll including LLM use.
I agree, except the magnitude is much larger: “there may be hundreds of partial sources used”, umm, try millions or even billions, not hundreds. I use a few of these LLMs every day and its ability to communicate as it does, in such a human-like style, cannot possibly come from mere hundreds of sources. Separate out when it gives specific facts from the overall “English competency” – that has to come from having analyzed vocabulary and syntax from English written sources on a scale we aren’t used to wrapping our brains around. I see evidence of this every time I make a typo in my prompt that is another english word and it “knows” what I meant in context and doesn’t have to ask for clarification, or times when I use a close but incorrect word and it paraphrases what I want back to me using better language (when I’m asking a technical question, NOT asking it to improve my writing).
While I am yet to try it, I understand that Perplexity is supposedly pretty good at providing references, and Claude is getting better at not hallucinating references with every update. So I am actually hopeful that LLMs will move in this direction and that it is technically possible, but as scholars and knowledge workers, we should also make the case why that is an important avenue for development. Ultimately, the people developing LLMs are researchers themselves.
We’ve been testing Perplexity and it provides some valid sources, but still woefully underperforms by citing irrelevant or shallow references.
Among the many concerns, which I think are well put here, is the growing question of researchers losing agency as scientists and authors in the conversation about scientific results. Increasingly, whether in funder OA mandates, or discussions about AI training, the researchers are often reduced to a recipient of X funds expected to produce Y result for society’s benefit; their agency seemingly waived the moment they agreed to accept the grant. As the scientists, shouldn’t the author have some say in how their intellectual property is used?
Some say, perhaps, but not veto power. US and Canadian copyright law, which is where the “intellectual property” legal concept emanates, have strong exceptions that make it clear the author/copyright holder does not have ALL say. And transformative uses are the very heart of the exception, and the kind of transformation AI does is the epitome of extreme transformation that is the most perfect example of why those exceptions exist. If Google Book’s snippets are legally acceptable under Fair Use, what AI does to the original text must be acceptable as well, as there is no trace of the original text’s specific sequence of words in the output. There may be exceptions caused by glitches, but those are bugs to be fixed, not inherent to the “use”.