Our expectations of reliability of information resources exist on a spectrum. On one side of that spectrum is the land of fiction, where facts aren’t terribly important. Think of coming up with a bedtime story for one’s child. The bar for truth-telling of that kind of story needn’t be high, if one exists at all. Then imagine you’re proceeding up the truthiness chain, to a child’s elementary or middle school paper. The exact facts aren’t as important as the general throughlines or the main concepts, and one might not criticize if every exact detail isn’t perfectly related. A collegiate paper, or a news article might have a significantly higher bar for accuracy and truth. Here, perhaps, there remains some level of bias, or incompleteness in the case of the research or knowledge of the student, or the availability of facts as news is reported in real-time. These may cause some errors to flow into the writing, and some may be forgivable. Even higher levels of accuracy are demanded in scholarly publishing, where the details might have serious consequences.

Scholarly publishing is most recognized for the characteristic rigor of the peer review process and editorial vetting prior to distribution. While certainly not a perfect system, the process has proven resilient and maintains the gold standard for publication rigor. If a new manufacturing process is researched and developed to produce bridges, vehicle systems, or airplane parts, we collectively want to trust that the best methods and review were applied prior to these products being released into the world for millions of people to use.
Now imagine you are in a hospital, and the doctor is reviewing your case and your health information to determine a diagnosis and treatment options for you. The doctors may now, or in the near future, turn to an AI system to identify your malady and recommend a course of medications. Would you want that system to draw on the best, most up-to-date, vetted science, or a preprint, or possibly just something that looks scientific that was posted on Reddit? I, for one, certainly would want the tool to only be trained and representing the highest quality materials when the AI system provides guidance on care for me or my loved ones. Unfortunately, the current suite of popular Generative AI (GenAI) tools doesn’t match this level of accuracy and trustworthiness. However, there certainly is a path toward improvement.
Yesterday, the International Association of Scientific, Technical & Medical Publishers (STM) released a consultation brief, Toward Responsible Use of Research Content in Generative AI, that sets forward a flag around which AI tool providers and publishers who are concerned foremost with verifiability, accuracy and trust, can use as rallying points. GenAI tools are making significant progress toward improving the research workflow process, in content discovery, accessibility, summarization and even in the discernment of new scientific approaches. However, we don’t want to lose what is valuable in the current ecosystem when we adopt these new tools into the research process. We don’t want to risk the implications for the people who rely on these tools for making important decisions, where accuracy is paramount. Addressing these concerns was the goal of an STM group that just released this white paper setting out an ambitious approach for how to improve these systems.
“If GenAI tools are to be trusted in scholarly contexts, it is important that they align with the core norms of scholarly communication and research practice, such as peer review, the Version of Record, attribution, citation, corrections, and transparency,” said Henning Schoenenberger, Vice President Content Innovation at Springer Nature and co-chair of the STM Task and Finish Group that produced the report. He continued, “Without clear and shared expectations for how research content is processed and presented in GenAI systems, there is a real risk of misinformation and erosion of trust, not to mention potential harm when it comes to sensitive domains like medicine.”
There have been significant issues that led to the development of this report and the concern within the research publishers that comprise the STM Association. Aaron Wood, Head of Product and Content Management for the American Psychological Association and the other co-chair of the group noted, “the significant risks associated with the unregulated use of research content in GenAI, such as the hallucination of citations, the spread of biases, incomplete or outdated information, and the potential for misinformed research threaten to erode trust in the scholarly process and could lead to direct harm.”
To address these concerns, the new STM paper lays out key considerations for GenAI systems as they interact with and provide a lens through which people begin to interact with scholarly content. The report begins with and stresses the need for respect of copyright as a fundamental value. This would be true of open content or subscribed content, since open access content still has the requirement of attribution, which most AI systems neglect. Building upon this, the paper includes other important themes: verifiability through attribution and citation; differentiation of peer-reviewed and non-peer-reviewed content and then prioritization of the version of record; inclusion of retractions, corrections and rebuttals in the content processing; bias mitigation; privacy protections; and transparency throughout the process. Importantly, the paper summarizes the key values of scholarly publishing and why each matters along with the risks of deprioritizing those values in GenAI systems.
While the paper doesn’t delve into technical detail about how AI systems work or offer direct solutions, it is illustrative to consider how scholarly research outputs and practice apply at each level of the GenAI technology stack. This grounding impacts how these recommendations apply to how the GenAI systems interact with content at different processing layers. This also provides some sense of how interconnected and nuanced some of these issues can be.
As a user interacts with a GenAI system, multiple processes work behind the scenes in producing an output. There is the basic layer of the training for the model, which is how the system comes to “understand” language. Think of this in a way like the operating system. This training is conducted with millions or possibly billions of content objects, trying to get as much background knowledge of the language and its structure as possible. In this layer, it is important to have clarity about what content is selected or how comprehensive those sources are, what rights and permissions may be attached to it, as well as identifiers and metadata to connect back to a source object.
The next layer is called the inference layer, where quality indicators, biases, and safeguards are applied so that the system can generate an output in a process called “tuning” the model. In a simplistic analogy, this is the stage in which a model is further trained to be polite, or avoid encouraging self-harm, be better at coding than at narrative, or perhaps be more rigorous in fact checking. At this level, issues of biases, incorporation of updates to the scholarly record and expectations of privacy and prioritization are important considerations related to scholarly research. Tying this stage specifically to the guidance for GenAI for scholarly research applications in the report, Schoenenberger said, “Blurring validated and non-validated content, GenAI systems often fail to distinguish peer-reviewed research from preprints, retracted papers, or non-scholarly sources, undermining reliability.”
The final layer where these ideas will play a role is in the representation layer, which is the user experience in how content is presented to the end-user. At this layer, tools might need to have consistent ways to signal things like citations, sourcing and attribution, much as footnotes might. Similar to NISO’s work on the Open Discovery Initiative for discovery tools to signal the sources a search tool is indexing, might there might also be ways for GenAI tools to provide information on the sources the system is drawing from to provide responses. Similarly, in a user-result display, consistent methods for how might tool providers consistently signal peer-review status, or the source of materials being a preprint or Version of Record are also needed.
Advancing these issues will not be simple. Wood elaborated on this noting, “there are significant technical and legal hurdles, such as ensuring that GenAI training and inference layers respect copyright laws and licensing agreements. There are also implementation challenges, such as developing reliable methods to identify the original source of an idea or ensuring that AI-generated responses are updated when a paper is retracted or corrected.”
Schoenenberger summarized by adding “The framework also speaks to policymakers by offering a concrete, domain‑specific articulation of ‘responsible AI’ grounded in reliable scholarly norms, helping to inform balanced regulation without constraining innovation.” Realistically, with technology advancing and being adopted as quickly as it has, existing processes for standards development or regulation are having trouble keeping pace. The paper suggests a lighter-weight model to advance some of these ideas among scholarly-focused AI tool providers.
One concrete proposal in the paper is the development of technical pilot implementations where publishers and tool providers explore working through some of these concrete issues. Schoenenberger described how “early pilots [with] GenAI providers to prioritize the Version of Record, to clearly distinguish peer‑reviewed content, to provide verifiable attribution and citations, as well as to surface corrections and retractions in outputs would be a great signal for a joint commitment.” Stakeholders interested in these discussions are encouraged to reach out to STM directly to engage in these next steps.
Looking forward, Wood envisions that a process whereby the players can create a useful dialogue. A positive outcomes in the near term would be the transition from this discussion document into a collaborative dialogue between publishers, tech providers, and the research community. “Ideally, this would lead to a shared vision for the responsible use of research content and how GenAI could align with the values and standards that underpin trusted research.” Particularly for tool providers aiming to serve the scholarly marketplace, where trustworthiness of content is paramount, the hope is an explicit agreement to maintain existing norms is obvious and achievable. With this as a basis, the community can partner to implement these existing principles into the new tools and applications for research applications. It is in everyone’s interest — AI tool developers, publishers, authors, librarians, and especially researchers — to make these new tools as reliable and trustworthy as possible. Schoenenberger concluded by saying, “GenAI can greatly accelerate discovery without undermining scholarly values – if built and governed responsibly.” STM’s framework is a good place to start from and then launch necessary dialogue, pilots and consistent practice across the community.