Our expectations of reliability of information resources exist on a spectrum. On one side of that spectrum is the land of fiction, where facts aren’t terribly important. Think of coming up with a bedtime story for one’s child. The bar for truth-telling of that kind of story needn’t be high, if one exists at all. Then imagine you’re proceeding up the truthiness chain, to a child’s elementary or middle school paper. The exact facts aren’t as important as the general throughlines or the main concepts, and one might not criticize if every exact detail isn’t perfectly related. A collegiate paper, or a news article might have a significantly higher bar for accuracy and truth. Here, perhaps, there remains some level of bias, or incompleteness in the case of the research or knowledge of the student, or the availability of facts as news is reported in real-time. These may cause some errors to flow into the writing, and some may be forgivable. Even higher levels of accuracy are demanded in scholarly publishing, where the details might have serious consequences.

Scholarly publishing is most recognized for the characteristic rigor of the peer review process and editorial vetting prior to distribution. While certainly not a perfect system, the process has proven resilient and maintains the gold standard for publication rigor. If a new manufacturing process is researched and developed to produce bridges, vehicle systems, or airplane parts, we collectively want to trust that the best methods and review were applied prior to these products being released into the world for millions of people to use.
Now imagine you are in a hospital, and the doctor is reviewing your case and your health information to determine a diagnosis and treatment options for you. The doctors may now, or in the near future, turn to an AI system to identify your malady and recommend a course of medications. Would you want that system to draw on the best, most up-to-date, vetted science, or a preprint, or possibly just something that looks scientific that was posted on Reddit? I, for one, certainly would want the tool to only be trained and representing the highest quality materials when the AI system provides guidance on care for me or my loved ones. Unfortunately, the current suite of popular Generative AI (GenAI) tools doesn’t match this level of accuracy and trustworthiness. However, there certainly is a path toward improvement.
Yesterday, the International Association of Scientific, Technical & Medical Publishers (STM) released a consultation brief, Toward Responsible Use of Research Content in Generative AI, that sets forward a flag around which AI tool providers and publishers who are concerned foremost with verifiability, accuracy and trust, can use as rallying points. GenAI tools are making significant progress toward improving the research workflow process, in content discovery, accessibility, summarization and even in the discernment of new scientific approaches. However, we don’t want to lose what is valuable in the current ecosystem when we adopt these new tools into the research process. We don’t want to risk the implications for the people who rely on these tools for making important decisions, where accuracy is paramount. Addressing these concerns was the goal of an STM group that just released this white paper setting out an ambitious approach for how to improve these systems.
“If GenAI tools are to be trusted in scholarly contexts, it is important that they align with the core norms of scholarly communication and research practice, such as peer review, the Version of Record, attribution, citation, corrections, and transparency,” said Henning Schoenenberger, Vice President Content Innovation at Springer Nature and co-chair of the STM Task and Finish Group that produced the report. He continued, “Without clear and shared expectations for how research content is processed and presented in GenAI systems, there is a real risk of misinformation and erosion of trust, not to mention potential harm when it comes to sensitive domains like medicine.”
There have been significant issues that led to the development of this report and the concern within the research publishers that comprise the STM Association. Aaron Wood, Head of Product and Content Management for the American Psychological Association and the other co-chair of the group noted, “the significant risks associated with the unregulated use of research content in GenAI, such as the hallucination of citations, the spread of biases, incomplete or outdated information, and the potential for misinformed research threaten to erode trust in the scholarly process and could lead to direct harm.”
To address these concerns, the new STM paper lays out key considerations for GenAI systems as they interact with and provide a lens through which people begin to interact with scholarly content. The report begins with and stresses the need for respect of copyright as a fundamental value. This would be true of open content or subscribed content, since open access content still has the requirement of attribution, which most AI systems neglect. Building upon this, the paper includes other important themes: verifiability through attribution and citation; differentiation of peer-reviewed and non-peer-reviewed content and then prioritization of the version of record; inclusion of retractions, corrections and rebuttals in the content processing; bias mitigation; privacy protections; and transparency throughout the process. Importantly, the paper summarizes the key values of scholarly publishing and why each matters along with the risks of deprioritizing those values in GenAI systems.
While the paper doesn’t delve into technical detail about how AI systems work or offer direct solutions, it is illustrative to consider how scholarly research outputs and practice apply at each level of the GenAI technology stack. This grounding impacts how these recommendations apply to how the GenAI systems interact with content at different processing layers. This also provides some sense of how interconnected and nuanced some of these issues can be.
As a user interacts with a GenAI system, multiple processes work behind the scenes in producing an output. There is the basic layer of the training for the model, which is how the system comes to “understand” language. Think of this in a way like the operating system. This training is conducted with millions or possibly billions of content objects, trying to get as much background knowledge of the language and its structure as possible. In this layer, it is important to have clarity about what content is selected or how comprehensive those sources are, what rights and permissions may be attached to it, as well as identifiers and metadata to connect back to a source object.
The next layer is called the inference layer, where quality indicators, biases, and safeguards are applied so that the system can generate an output in a process called “tuning” the model. In a simplistic analogy, this is the stage in which a model is further trained to be polite, or avoid encouraging self-harm, be better at coding than at narrative, or perhaps be more rigorous in fact checking. At this level, issues of biases, incorporation of updates to the scholarly record and expectations of privacy and prioritization are important considerations related to scholarly research. Tying this stage specifically to the guidance for GenAI for scholarly research applications in the report, Schoenenberger said, “Blurring validated and non-validated content, GenAI systems often fail to distinguish peer-reviewed research from preprints, retracted papers, or non-scholarly sources, undermining reliability.”
The final layer where these ideas will play a role is in the representation layer, which is the user experience in how content is presented to the end-user. At this layer, tools might need to have consistent ways to signal things like citations, sourcing and attribution, much as footnotes might. Similar to NISO’s work on the Open Discovery Initiative for discovery tools to signal the sources a search tool is indexing, might there might also be ways for GenAI tools to provide information on the sources the system is drawing from to provide responses. Similarly, in a user-result display, consistent methods for how might tool providers consistently signal peer-review status, or the source of materials being a preprint or Version of Record are also needed.
Advancing these issues will not be simple. Wood elaborated on this noting, “there are significant technical and legal hurdles, such as ensuring that GenAI training and inference layers respect copyright laws and licensing agreements. There are also implementation challenges, such as developing reliable methods to identify the original source of an idea or ensuring that AI-generated responses are updated when a paper is retracted or corrected.”
Schoenenberger summarized by adding “The framework also speaks to policymakers by offering a concrete, domain‑specific articulation of ‘responsible AI’ grounded in reliable scholarly norms, helping to inform balanced regulation without constraining innovation.” Realistically, with technology advancing and being adopted as quickly as it has, existing processes for standards development or regulation are having trouble keeping pace. The paper suggests a lighter-weight model to advance some of these ideas among scholarly-focused AI tool providers.
One concrete proposal in the paper is the development of technical pilot implementations where publishers and tool providers explore working through some of these concrete issues. Schoenenberger described how “early pilots [with] GenAI providers to prioritize the Version of Record, to clearly distinguish peer‑reviewed content, to provide verifiable attribution and citations, as well as to surface corrections and retractions in outputs would be a great signal for a joint commitment.” Stakeholders interested in these discussions are encouraged to reach out to STM directly to engage in these next steps.
Looking forward, Wood envisions that a process whereby the players can create a useful dialogue. A positive outcomes in the near term would be the transition from this discussion document into a collaborative dialogue between publishers, tech providers, and the research community. “Ideally, this would lead to a shared vision for the responsible use of research content and how GenAI could align with the values and standards that underpin trusted research.” Particularly for tool providers aiming to serve the scholarly marketplace, where trustworthiness of content is paramount, the hope is an explicit agreement to maintain existing norms is obvious and achievable. With this as a basis, the community can partner to implement these existing principles into the new tools and applications for research applications. It is in everyone’s interest — AI tool developers, publishers, authors, librarians, and especially researchers — to make these new tools as reliable and trustworthy as possible. Schoenenberger concluded by saying, “GenAI can greatly accelerate discovery without undermining scholarly values – if built and governed responsibly.” STM’s framework is a good place to start from and then launch necessary dialogue, pilots and consistent practice across the community.
Discussion
2 Thoughts on "STM Plants a Flag About Responsible Use of Research Content in GenAI"
Hi Todd,
Your articles make me think. Today I have two, three, maybe four thoughts after reading the article.
First, while I know that GenAI is short for GenerativeAI, it also can be thought of as GeneralAI. I am not sure GeneralAI will ever meet the needs of scholars. Google built Google Scholar. We will see ScholarAI soon. It will not be a single tool, but multiple tools that do very specific things. It will not just write articles. There will be a tool specifically trained on Medicare data, where scholars can go to get trusted insights into the data there. It will be a place where peer reviewers can go to get insights into the papers they are reviewing. These tools will not just share their magic results but detail the process that got them to the results.
Second, this will of course require open data. Publisher’s have been responding to the need for OA articles. Researcher’s have to share the data that results from their work. It should be a mandate of funders.
Third I worry that today authorship has become relegated to the graduate students and fellows. If AI helps them that is okay. We need to see authorship return to an important feature of a researcher’s job. There have to be consequences for failure. You submit an article to Journal of AI Authorship and the checks identify X numbers of fictious citations, and you are banned from submitting to the journal for a year. Do it a second time and the ban is 10 years. All five authors on the article. A retraction results in a three-year ban.
Fourth, publishers have a lot of data on the content they publish that should be included in the submission AI checks. We tag articles with various metadata. We have data on how many citations and downloads each article has received. One piece of information an editor should receive when the submission is passed to them is how well similar articles have performed in the past. An author submits an article, When do Mandarin speaking children with both a cochlear implant and a hearing aid begin to talk. The checker confirms all the details. The email was confirmed, the author has a valid OrchidID, the citations are clean. The implication is this is good science. It also says that the publisher has published two similar articles, here is the link to them. Neither received a single citation in four plus years. One was downloaded 89 times, and one was downloaded 112 times.
Not sure everything here is on target. I hope it is worth the read.
Hi Todd, great to see STM engaging proactively on this topic. I had some clarifications to the terms and timing implied in your article though which are hopefully helpful. The inference phase is not the same as the tuning / fine-tuning phase. During pre-training and fine-tuning the mathematical weights in the model are not frozen, the model is still learning and being adjusted. During inference, the model weights are frozen. The article above states that “As a user interacts with a GenAI system, multiple processes work behind the scenes in producing an output” – I want to clarify for readers that not all these phases are happening as we enter our prompts into the common LLMs. Pre-training and fine-tuning have already happened, usually only the inference phase is happening when we enter our prompt.
However, even though the model weights are frozen, the LLMs can still be directed to behave in certain ways of course during inference. Our initial context window looks blank but most LLMs inject a pre-prompt of text and instructions into the context window unseen to remind the model what to do and not to do. During inference we can tell a model to only look at certain content, e.g. run inference only on peer-reviewed scientific content from these sources (Model Context Protocol (MCP), Retrieval-Augmented Generation (RAG), etc.)
One other note on terminology – pre-training / fine-tuning / inference are best referred to as phases, rather than layers. In the context of LLMs and neural networks, layers are the multiple data processing steps within a model when the data is passed through and weighted sums are run.
For publishers, researchers, and indeed all users it is always important to ask – if I input my data into a model, is it using that data to run fine-tuning or just inference? In the former case, it is ingesting that knowledge / IP into the model’s weights and improving their product. In the latter case, the model weights are not updated and your input remains separate. However, even then, users should always look carefully at the data retention policies of the LLM being run as LLMs may still retain your data for abuse monitoring even if they are not training on it (e.g. https://developers.openai.com/api/docs/guides/your-data).
Finally, for those of you wanting to understand more about how these models work, let me recommend these excellent videos below from Andrej Karpathy, co-founder of OpenAI. They are very long but accessible and rewarding to the non-specialist. In the second one he pre-trains an LLM from scratch and you can see the mathematical magic start to take shape as he creates new Shakespearean text!
Deep Dive into LLMs like ChatGPT
https://www.youtube.com/watch?v=7xTGNNLPyMI
Let’s build GPT: from scratch, in code, spelled out.