Editor’s Note: Today’s post is by Gwen Weerts. Gwen is the Journals Manager at SPIE, and the Editor-in-Chief of the SPIE Society magazine, Photonics Focus. Gwen joined SPIE in 2008 and has 16 years of experience in scholarly publishing.

This post was originally published in Photonics Focus on January 1, 2024.

In 1454, Gutenberg’s prototype printing press began commercial operation, and the publishing industry was born. With it came a host of new concerns that are still relevant six centuries later: literacy, plagiarism, information censorship, proliferation of false or unvetted information, and — most worrisome to the Catholic church at the time — who should have access to information, what type of information could they have, and what type of people should be allowed to have it?

Many of these worries have been stirred up again at specific flashpoints in history, and most recently by the November 2022 release of ChatGPT-3. Though chatbots existed before GPT-3, that iteration introduced realistic conversation and a surprising capability for idea generation that previous iterations lacked. In the past year, development of large language models (LLMs) has been rapid (we’re already on GPT-4), and their role in society — and scholarly publishing, in particular — has been debated with equal parts anxiety and excitement. In this article we’ll weigh these issues on the balance.

person at desk writing using a laptop with overlay of AI on screen


“Use ChatGPT at your own peril. Just as I would not recommend collaborating with a colleague with pseudologia fantastica, I do not recommend ChatGPT as an aid to scientific writing,” writes Robin Emsley in a March 2023 editorial in the Nature journal Schizophrenia, citing the bot’s tendency to invent references. These hallucinations, as they are known, invent false information that is presented by the bot with a dispassionate factual tone that invites trust, and often requires an expert in the subject to detect.

Chatbot developers have heard that concern, and most chatbots allow users to adjust the “creativity” of responses, from very creative, dialed down to “just the facts, please.” In this conservative mode, Bing Chat, for example, does a decent job of using real references—though they all need to be double checked.

But the takeaway shouldn’t be to avoid the “creative mode” setting on LLMs altogether.

“Keep in mind that hallucinations are closely related to creativity. What is a new idea if not a hallucination about something that doesn’t yet exist?” writes Tyler Cowan, economist and author of “How to Learn and Teach Economics with Large Language Models, including GPT,” a paper with relevance far beyond economics.

In fact, LLMs excel at generating new ideas, which is one of their strengths for researchers. Darren Roblyer, professor of biomedical engineering at Boston University, and Editor-in-Chief of the journal Biophotonics Discovery, says that he uses LLMs to help identify new research questions. “It’s very good for background knowledge,” he says. “You want to find out what other people have done. Has anyone used diffuse optics or OCT to monitor kidney dialysis?” He says that the LLMs can help identify research gaps, which are opportunities for new exploration.


LLMs excel at generating text and images, and a September 2023 article in Nature reports that more than 30 percent of 1,600 researchers surveyed relied on LLMs to generate code. Authors Richard Van Noorden and Jeffrey M. Perkel also report that 28 percent of those surveyed use LLMs to help write manuscripts, and 32 percent reported that these tools helped them write manuscripts faster — astounding rates of adoption given that few people could define “LLM” prior to November 2022.

This last statistic is the source of much handwringing in scholarly publishing. While chatbots excel at ideation, they cannot be held accountable for their output, so they cannot be attributed authorship in scientific studies. The Council on Publishing Ethics (COPE) takes a clear stance: “Authors are fully responsible for the content of their manuscript, even those parts produced by an AI tool, and are thus liable for any breach of publication ethics.”

Few scholarly publishers have banned usage of generative AI tools completely; most, including Springer Nature and SPIE, prohibit attributing authorship to a chatbot, but otherwise allow its use, provided that this is disclosed in the methods or acknowledgements section.

Transparency like this will be key to the adoption of AI in publishing, says Jessica Miles, Vice President for Strategy and Investments at Holtzbrinck Publishing group, who debated in favor of AI in scholarly publishing at the closing plenary of the 2023 annual meeting of the Society for Scholarly Publishing. She pointed to the €2 billion invested by the STM publishing industry since 2000 to digitize the scholarly record, make it more findable, and safeguard its integrity by developing tools to identify plagiarism and other types of fraud. She said, “These examples of how we’ve sustained trust and transparency by supporting industry-wide standards, and developed technology and infrastructure in response to transformation, provide a blueprint for how academic publishing will continue to evolve and endure in response to AI.”

Peer review/Confidentiality

Every journal editor will tell you that the greatest pain point in scholarly publishing is finding qualified and willing reviewers. Due to the surge in manuscript submissions, increasing research specialization, limited volunteer time, and a cultural shift towards work-life balance, there just aren’t enough reviewers. Can AI help?

The US National Institutes of Health (NIH) says no. In a June 2023 blog post, NIH emphasizes that using AI tools to analyze or critique NIH grant applications is a breach of confidentiality, because “no guarantee exists explaining where AI tools send, save, view, or use grant application, contract proposal, or critique data at any time.” The examples given describe uploading a proposal or manuscript to an LLM and asking it to write a first draft of the review — a two-minute task that would save hours of time for human reviewers.

But the scenario described by the NIH lacks nuance about the way AI tools can, and possibly should, be used. For example, LLMs can help reviewers conduct literature reviews without disclosing manuscript-specific information. They can also help identify overlooked seminal work in a research area. Roblyer says, “That can be really tedious for reviewers now, to find the right papers, who else has published in this area, does the current paper point to the relevant publications or not.”

Even the confidentiality issue can be overcome. Bennett Landman, chair of the Department of Electrical and Computer Engineering at Vanderbilt University and Editor in Chief of the Journal of Medical Imaging, notes that institutions like Vanderbilt are increasingly licensing privately owned LLMs that protect privacy. “That’s an addressable problem,” he says.

About the peer review question, Landman believes that rules strictly banning LLMs are futile because people will find a way around them. He continues, “All of this regulation and ‘thou-shalt-not’ generates cheating potential. Why don’t we take the problem head on? We should have an LLM input for reviews but shouldn’t mistake it for a human input. Human reviewers should be able to comment whether the LLM is correct.” He notes that by asking ChatGPT to do a first review, it would avoid duplicating effort across reviewers, and editors might get deeper knowledge from reviewers if they’re not focused on language, structure, and clarity.


When used correctly, LLMs have the potential to help manage the enormous inflow of scholarly research that needs to be screened and vetted before it can be published. Journal editors, in particular, have to make a lot of important decisions and with limited time. They must assess a paper’s relevance to the journal and its novelty. Chatbots can present this type of abstract compare/contrast information much more clearly than traditional internet search tools — with the usual caveats that their conclusions must be validated.

“You can see it as a useful tool that helps an editor be better,” says Roblyer, “but will it incentivize the same work that’s been highly cited in the past? Will it cause the whole field to be more conservative because you’re pushing things in the direction of a model trained on old data?”

That’s one concern, as is the potential for reinforced bias: LLMs rely on statistical relationships between written words. In fact, according to Landman, “ChatGPT by definition embodies the bias of our society.” When LLMs generate new text, they can propagate word relationships that are no longer accurate, or even downright harmful.

An October 2023 study by Omiye et al., in Digital Medicine reveals that LLMs, trained on outdated and sometimes discredited medical information, can perpetuate racism. For instance, the chatbot made false race-based medical assertions when queried about lung capacity and kidney function. This raises concerns about using LLMs in healthcare for assisting with diagnoses or treatments until these biases can be resolved.


Fraud, particularly from paper mills producing fake research papers, is an escalating issue in scholarly publishing that has been reported extensively in Photonics Focus and elsewhere. Previously, these fraudulent papers were easily identifiable due to poor grammar, structure, and substance. However, the advent of LLMs, which excel in generating grammatically perfect text with a specific tone, has made these fake papers harder to detect.

As Landman puts it, “ChatGPT is very good at talking to us like a trusted friend, but it has no social construct that makes it know it’s our trusted friend.” LLMs are a tool without a moral compass and rely on their operators to provide one — just like every other tool.

According to Miles, “It is people, not technology like AI, that fuel these threats. People can, working collaboratively, develop and implement strategies for overcoming these crises.”

Let’s hope she is right, because this same language proficiency that makes them useful for malfeasance makes LLMs a very helpful tool for people whose first language is not English. English language editing is the primary usage reported by 55 percent of the people surveyed in Van Noorden and Perkel’s Nature article.

LLMs make scholarly publishing more accessible to people who have historically struggled to publish and advance in their careers due to a language barrier. These tools allow a researcher to focus their efforts on the science rather than a grammatically perfect English-language manuscript.

Landman gives a useful analogy: “Before widespread use of calculators, my parents spent a lot of time learning slide rules and multiplication tricks that a cheap calculator can do now. We still learn our multiplication tables, but that education now peters out in middle school. It transforms into geometric reasoning and tables, and word problems, and structuring the problem so you can do it on a calculator.

“But we’re still writing paragraphs the same way as a 1940s textbook,” Landman continues. “If we have writing tools, much like a calculator for writing, that remind you what 7 x 13 is, and you don’t need to think about it, then could we teach kids to reason at a higher level earlier? Could we get people to structure the argument and the meaning of the language at a deeper level than the language itself?” And here lies one of the greatest uses of generative AI: it can potentially equalize opportunities for researchers in non-English-speaking countries. In the future, scientists might skip the effort of learning English altogether and conduct their research and write papers in their own language; submit it for publication in the same language; and rely on integrated LLMs for on-demand translation for reviewers, editors, and readers in each of their preferred languages.

The issues with generative AI chatbots are known: accountability, potential to propagate bias, and not-always-reliable accuracy. But they can also help us humans to think in different and creative ways, and to streamline tedious tasks, which may ultimately be a boon for burdened researchers.

“Think of GPTs not as a database but as a large collection of extremely smart economists, historians, scientists, and many others whom you can ask questions,” says Cowan. “If you ask them what was Germany’s GDP per capita in 1975, do you expect a perfectly correct answer? Well, maybe not. But asking them this question isn’t the best use of their time or yours. A little bit of knowledge about how to use GPT models more powerfully can go a long way.”

Six centuries after the invention of the printing press, we can say confidently that history ruled in its favor. How will history rule on the introduction of generative AI? As the balance tips back and forth, it’s not yet clear whether AI will be a boon to scholarly publishing or a thorn in its side. What is certain is that this conversation has just begun.

Gwen Weerts

Gwen Weerts is the Journals Manager at SPIE, and the Editor-in-Chief of the SPIE Society magazine, Photonics Focus. Gwen joined SPIE in 2008 and has 16 years of experience in scholarly publishing.


8 Thoughts on "Guest Post — Hanging in the Balance: Generative AI Versus Scholarly Publishing"

Great article, thank you. Key messages we should be conveying as a community is that chatbots don’t ‘understand’ anything, therefore they shouldn’t be used for advice or decision-making and they also ‘don’t do truth’. Last but not least: if the AI makers don’t tell us where their training data came from, perhaps we should not be using their bots (for they may have ingested incorrect, or even retracted research, which can have very serious consequences).

Yes, these are good takeaways. I hope that transparent training data will become a stamp of trust for AI tools.

Thanks for this perspective; why must the decision to us as publication professionals be posed as such, though? Why can’t we see it as a dialectic? A both/and, rather than an either/or? Why can’t we see this as a time of great opportunity, to see the power and promise that generative AI affords to help speed time to decision, improve access to content, enable underserved communities who may never otherwise read or understand technical information to derive value from its usage? In a word, why can’t we see generative AI’s ability to augment?

Thanks for the question. I believe (hope, really) that AI will settle into a both/and dialectic for both science research and scholarly publishing. Early knee-jerk reactions to ban use of LLMs have already softened (see AAAS), and tools to detect unethical uses of AI are under rapid development. Hopefully most publishers and research institutions will find ways to embrace the augmentations that AI offers, while shoring up defenses against the vulnerabilities it could be used to target.

Thanks for taking the time to share your response; it is greatly appreciated. You could add the International Society for Medical Publication Professionals (ISMPP) to that list, as well as the Healthcare Communications Association (HCA), both of which have issued responsible guidelines for practitioners in this space. In full transparency, I’ll state that I co-chair the Artificial Intelligence Task Force for ISMPP and was the lead author on HCA’s AI Roadmap.

– Matt Lewis, MPA
Global Chief Artificial and Augmented Intelligence Officer, Inizio Medical.

Because many of the tools are of really, really bad quality. So sure, we can argue that cars are useful. But there are cars that really don’t protect the driver nor the pedestrian while also polluting a little bit more than others. That’s why folks are having these discussions. More examples here: https://arxiv.org/pdf/2312.04350.pdf which makes LLM a bit less suitable for healthcare, finance and law for example.

Gween, thanks for the article. Agree around language model quality. Would advocate a conscious choice of model (open source, commercial), large/small or industry specific along with the tool which is giving you access to the model. To some points above transparent training data and quality is something available on HuggingFace (HF). where you can access model card info for your “model of choice” around training data transparency, and even how much CO2 emissions were used to train the model (some of the CO2 stats are quite alarming around how much they equate to in real world terms.
One popular language model used equivalent of traveling by electric train for 879,548 km to train its latest model.

Interesting benchmark system called TruthfulQA which allows language model suppliers of all shapes and sizes to score the model responses on being truthful and informative.

HF co-founder I believe is a Uni academic so its grounded in research and not technology.

Comments are closed.