Elsevier has been under the spotlight this month for publishing a paper that contains a clearly ChatGPT-written portion of its introduction. The first sentence of the paper’s Introduction reads, “Certainly, here is a possible introduction for your topic:…” To date, the article remains unchanged, and unretracted. A second paper, containing the phrase “I’m very sorry, but I don’t have access to real-time information or patient-specific data, as I am an AI language model” was subsequently found, and similarly remains unchanged. This has led to a spate of amateur bibliometricians scanning the literature for similar common AI-generated phrases, with some alarming results. But it’s worth digging a little deeper into these results to get a sense of whether this is indeed a widespread problem, and where such papers have made it through to publication, where the errors are occurring.

1950s style rendering of a robot invasion

Several of the investigations into AI-pollution of the literature that I’ve seen employ Google Scholar for data collection (the link above, and another here). But when you start looking at the Google Scholar search results, you notice that a lot of what’s listed, at least on the first few pages, are either preprints, items on ResearchGate, book chapters, or often something posted to a website you’ve never heard of with a Russian domain URL. The problem here is that Google Scholar is deliberately a largely non-gated index. It scans the internet for things that look like research papers (does it have an Abstract, does it have References), rather than limiting results to a carefully curated list of reputable publications. Basically, it grabs anything that looks “scholarly”. This is a feature, not a bug, and one of the important values that Google Scholar offers is that it can reach beyond the more limiting inclusion criteria (and often English language and Global North biased) content of indexes like the Web of Science.

But what happens when one does similar searches on a more curated database, one that is indeed limited to what most would consider a more accurate picture of the reputable scholarly literature? Here I’ve chosen Dimensions, an inter-linked research information system provided by Digital Science, as its content inclusion is broader than the Web of Science, but not as unlimited as Google Scholar. With the caveat that all bibliometrics indexes are lagging, and take some time to bring in the most recently published articles (the two Elsevier papers mentioned above are dated as being from March and June of 2024 and so aren’t yet indexed as far as I can tell), my results are perhaps less worrying. All searches below were limited to research articles (no preprints, book chapters, or meeting abstracts) published after November 2022, when ChatGPT was publicly released.

A search for “Certainly, here is” brings up a total of ten articles published over that time period. Of those ten articles, eight are about ChatGPT, so the inclusion of the phrase is likely not suspect. A search for “as of my last knowledge update” gives a total of six articles, again with four of those articles focused on ChatGPT itself. A search for “I don’t have access to real-time data” brings up only three articles, all of which cover ChatGPT or AI. During this same period, Dimensions lists nearly 5.7M research articles and review articles published, putting the error rate for these three phrases to slip through into publications at 0.00007%.

Retraction Watch has a larger list of 77 items (as of this writing), using a more comprehensive set of criteria to spot problematic, likely AI-generated text which includes journal articles from Elsevier, Springer Nature, MDPI, PLOS, Frontiers, Wiley, IEEE, and Sage. Again, this list needs further sorting, as it also includes some five book chapters, eleven preprints, and at least sixteen conference proceedings pieces. Removing these 32 items from the list suggests a failure rate of 0.00056%.

While many would argue that this does not constitute a “crisis”, it is likely that such errors will continue to rise, and frankly, there’s not really any excuse for allowing even a single paper with such an obvious tell to make it through to publication. While this has led many to question the peer review process at the journals where these failures occurred, it’s worth considering other points in the publication workflow where such errors might happen. As Lisa Hinchliffe recently pointed out, it’s possible these sections are being added at the revision stage or even post-acceptance. Peer reviewers and editors looking at a revision may only be looking at the specific sections where they requested changes, and may miss other additions an author has put into the new version of the article. Angela Cochran wrote about how this has been exploited by unscrupulous authors adding in hundreds of citations in order to juice their own metrics. Also possible, the LLM-generated language may have been added at the pageproof stage (whether deliberately or not). Most journals outsource typesetting to third party vendors, and how carefully a journal scrutinizes the final, typeset version of the paper varies widely. As always, time spent by human editorial staff is the most expensive part of the publishing process, so many journals assume their vendors have done their jobs, and don’t go over each paper with a fine toothed comb unless a problem is raised.

Two other important conclusions can be drawn from this uproar. The first is that despite preprints having been around for decades, those both within and adjacent to the research community clearly do not understand their nature and why they’re different from the peer reviewed literature, so more educational effort is needed. It should not be surprising to anyone that there are a lot of rough early drafts of papers or unpublishable manuscripts in SSRN (founded in 1994) or arXiv (launched in 1991). We’ve heard a lot of concern about journalists not being able to recognize that preprints aren’t peer reviewed, but maybe there’s as big a problem much closer to home. The second conclusion is that there seems to be a perception that appearing in Google Scholar search results offers some assurance of credibility or validation. This is absolutely not the case, and perhaps the fault here lies with the lack of differentiation between the profile service offered by Google Scholar, which is personally curated by individuals and its search results which are far less discriminating.

Going forward, I would hope that at the journals where the small number of papers have slipped through, an audit is underway to better understand where the language was introduced and how it managed to get all the way to publication. Automated checks should be able to weed out common AI language like this, but they likely need to be run at multiple points in the publication process, rather than just on initial submissions. While the systems in place seem to be performing pretty well overall, there’s no room for complacency, and research integrity vigilance will only become more and more demanding.

David Crotty

David Crotty

David Crotty is a Senior Consultant at Clarke & Esposito, a boutique management consulting firm focused on strategic issues related to professional and academic publishing and information services. Previously, David was the Editorial Director, Journals Policy for Oxford University Press. He oversaw journal policy across OUP’s journals program, drove technological innovation, and served as an information officer. David acquired and managed a suite of research society-owned journals with OUP, and before that was the Executive Editor for Cold Spring Harbor Laboratory Press, where he created and edited new science books and journals, along with serving as a journal Editor-in-Chief. He has served on the Board of Directors for the STM Association, the Society for Scholarly Publishing and CHOR, Inc., as well as The AAP-PSP Executive Council. David received his PhD in Genetics from Columbia University and did developmental neuroscience research at Caltech before moving from the bench to publishing.

Discussion

26 Thoughts on "The Latest “Crisis” — Is the Research Literature Overrun with ChatGPT- and LLM-generated Articles?"

There is a huge difference between an entire article being written by AI, as the title of this post suggests, and having a few paragraphs being written by AI, as the actual evidence in most of this post suggests. I think the most worrisome part of this is that the journal editors are doing such a poor job that they aren’t catching those obvious “certainly…” type phrases, not because of what they imply about authorship, but just because they don’t belong in the final text at all.

I find the debate about using AI and academic integrity is bringing out an inconsistency in our very reasons for opposing “plagiarism”. That is, is the problem that the ideas aren’t yours per se, or that you are “stealing” another person’s ideas? When the “other person” is not a human, suddenly this distinction is in sharp relief. If you aren’t stealing from someone else, is there still a bad thing happening here, or is this just a much more “humanities” version of using R or SPSS to do your quantitative analysis?
Usually I find it frustrating when people mix up copyright law with plagiarism, but in this situation I think copyright law has something useful to inform the plagiarism discussion. Copyright law in the US at least is very clear that non-humans (eg monkeys, elephants, and computers) can’t be credited with authorship/creatorship.

“Also possible, the LLM-generated language may have been added at the pageproof stage…”. That seems a bit far fetched. More likely some manuscripts don’t get carefully reviewed. More interesting to me than including LLM text in low quality articles is the sophistication of AI generated cell biology images. They’re getting to the point that sleuths like Elisabeth Bik can’t tell them from the real thing.

The authors of one of the suspect papers have stated that the inclusion of the text was a cut-and-paste error. Why would such an error be “far fetched” during corrections but less so in other parts of the publication process?

This may be but what’s the excuse for the failure to include the required disclosure statement? Everyone is focusing on the sloppy editing. But, what this signals to me is that there’s probably a bigger crises of non-disclosure, which is a different issue and – given Elsevier doesn’t prohibit the use of generative AI for editorial assistance but does require disclosure – to me is the bigger ethical question. The disclosure should have been made even if their was no copy/paste sloppiness. So, how many authors are using AI and not disclosing – and not tipping their hand by sloppiness?

In this case, I believe the authors claimed that they had asked ChatGPT for a summary, but found it lacking and decided not to use it. Then it got “accidentally” pasted into some version of the paper. If accurate and believable, then no disclosure was necessary (as they did not deliberately use an AI).

All that said, you do get to the heart of the issue, and why the fuss over a tiny number of papers found with obvious tells is less important than the bigger picture that many are less sloppy and probably using these tools. While I do think it’s reasonable to require disclosure (if publishing is part of the scientific process, then you should record the tools used as you do in the Materials and Methods section for your experiments), I personally don’t think the use of AI in writing is all that problematic. What I care about is whether the experiments/research were/was done correctly, described accurately, and that the conclusions drawn are supported by the data presented. In the end, I don’t really care who (or what) wrote the story about the research, I’m interested in the research and ensuring it’s valid. We know that ghostwriting is fairly rampant in the medical world, and that industry research companies often have publication planners and writers on staff who put together publications. Is this any different ethically, as long as the author who puts their name on the paper vouches for its contents? Is that what matters (I’m staking my reputation on what’s in this document) rather than who/what chose the actual words and put them in that order?

Of course this is a different matter from fraudulent papers where the research wasn’t actually done, and yes, AI does simplify the process of creating fraudulent work, but that’s a separate issue. It does strike me as interesting that the two papers linked to above had AI text in an Introduction and an Abstract, rather than in the Results or the Conclusion sections of the papers.

I’ve not seen any author statement on either of these examples. I’ve seen one person say they are reporting what one author told him — but that didn’t include that the AI text was found wanting. I did see that the same intermediary say one of the authors says they had emailed Elsevier about this issue post-publication. Everyone decrying how long this is taking to fix and none of the authors just getting the right version out there by posting it up on a preprint server. I mean, for goodness say, it’s not like we need to wait for libraries to tip in a new page of the printed journal. Any way, somehow the authors also turned back page proofs without noticing the inclusion either. So, while I like you am not too concerned about how words get generated in this context (I have a statement in my syllabus this semester that students can use gen AI if they disclosure and explain why they decided it was the best choice), it matters if authors don’t take responsibility for their work and don’t make required disclosures. What other corners are they cutting?

I remember seeing something from the authors on Twitter, will try to dig it out. Regardless, I agree 100%. When you publish a paper, you are building/risking your reputation and your career. From my researcher days, the notion that I wouldn’t have carefully pored over every single character of every single draft of every paper I published escapes me. Those authors will now always be known as the ChatGPT people, I guess better than those permanently known as the gigantic rat genitals people, but still. If you are so sloppy in your writing, why wouldn’t I expect you to be similarly sloppy in your experiments? What did you accidentally “cut and paste” there?

This is an interesting discussion; thanks for advancing it. Given the challenges facing scientific and medical publishing, which is a bigger crisis? 1) That overworked, underpaid, stressed, researchers and academics who don’t directly benefit financially from their content are now leveraging AI to speed time to publication, or 2) That the majority of entrenched publishers continue to act as if human friction and delays in getting novel science and groundbreaking medicine to patients and doctors who are trying to improve human suffering, which could be improved significantly with AI, is itself a problem to be dealt with? Full disclosure, I serve as Chief AI Officer of Inizio Medical and am a Founding Board Member of the Society for Artificial Intelligence and Health, but my comment is mine and mine alone.

It is alarming that this slipped through after all the work done to tighten up publishing integrity over the last year or two. It suggests more sophisticated fake/fraudulent AI content is also getting through. More generally, it is another embarrassing, public black eye for scholarly publishing and science.

It’s clear from this post that the crisis is not about generative AI, which simply reveals the underlying problem. The real issue is the faulty editorial workflow, which Angela Cochran identified as far back as 2017. The human editor should have the last check of the paper. If you allow authors to make changes, then those changes should be monitored by the editor, and a system such as track changes applied to ensure that any changes the author makes, whether or not requested by the editor, are identifiable for checking.

What this reveals is that scholarly publishing still retains a culture of trust, in this case, trust that the author will not make extensive changes at proof stage. It is no longer possible to run scholarly publishing with the assumption that the author will behave responsibly.

I think you’re largely right, but those needs are directly in opposition to the increasing pressure for journals to 1) publish faster, and 2) publish cheaper. As noted in the post, the sorts of human interventions you suggest are the most expensive parts of the process. Would the community accept slower publication and higher APCs/subscription prices in order to ensure this level of scrutiny?

Well, that’s for the scholarly community to decide, but personally, I don’t think there is a choice, to preserve the credibility of academic publishing.

“Automated” is a loaded word here — every check that I know of requires at least some level of human interpretation and intervention.

That said, there are good recommendations coming out of the STM Research Integrity Hub (https://www.stm-assoc.org/stm-integrity-hub/) and tools include plagiarism checkers, image integrity checkers, and paper mill checkers. As far as I know, there are no reliable automated tools for determining whether text or images were AI-generated. But screening for phrases like the ones mentioned above seems a reasonable step.

And as noted, this makes publishing slower and more expensive.

I think that screening for phrases, as you say, could work. A mechanism that detects hallucinated content, including fake references/citations, could also help. But then, AI will also evolve…

As the EIC of a journal, I read every manuscript that is submitted. Lately, I have detected a handful of manuscripts that have the hallmarks of being generated (all or in part) by an LLM. It is fairly obvious (to me) to identify such manuscripts, because the text reads like “word salad”, that is, bland, general sentences that do not seem to converge on a clear meaning and lack specific details. A superficial reading is not good enough in such cases. As pointed out in the post, having actual humans read the manuscripts or proofs is necessary, but expensive.

Same here. I had one the other day that was not really in the scope of the journal, had that “word salad” feel you mentioned, and had only 5 total references. (And two of those were fake.)

Interesting, another question is what percent of peer review reports/peer reviewers are using AI, and is incorrect use of AI in this area able to be detected, I heard this is potentially a growing, and yet somewhat hidden problem as well.

“time spent by human editorial staff is the most expensive part of the publishing process”

It turns out more and more that the most *expensive* part of the publishing process is editorial and other negligence actively encouraged by those who never tire of finding more “cost-effective” and “streamlined/frictionless” ways to “improve the author experience.” It’s a sure way to steer scholarly publishing towards obsolescence because however happy authors (or their administrators) might be for having their work published, fewer and fewer people will be inclined or able to read this work (and AI reading AI-generated or “enhanced” texts is like an empty house of mirrors).

The cost of not having enough human eyes (i.e., editorial staff and peer reviewers) will be (and already is) exorbitant for both science and the public. High-quality, sustainable science and scholarship is slow and tedious and does not respond to exhortations for speed and efficiency. It’s a non-negotiable feature, not a bug.

The one thing that is absolutely certain is that no copyeditors have been retained for the journals where these articles have slipped through. No proper copyeditor — one who edits for grammar, spelling, clarity, logic, and flow– is going to allow these kinds of things to pass through.

The big platforms do not do proper copyediting; the provide spelling and grammar checking from international outsourcing firms (packagers) and pay them literally one quarter of what I am paid, or they use their production team (layout, not copyeditors) to do the copyedits.

Copyeditors know that the vast majority of the time, we are the first ones to read a manuscript all the way through, after R&R or when there is no peer review. These instances prove that when we aren’t being hired, no one is reading a manuscript from start to finish.

Do you feel that Dimensions is an inappropriate tool to use for these purposes? Which database would have been better?

Thanks David, you brought some level-headedness to the conversation. Your deep dive into the Digital Science data has me feeling much better about this issue being overblown than I would have originally thought. Also, you make a good point about trying to understand why and how these things happen. An author using GPT to help them with language and accidentally inserting a GPT-phrase is very different than AI writing an article for them.

The main question that needs to be answered: Is whether the papers passed through peer-reviewed or not. If so, what about their comments.

Comments are closed.