NOTE: This is the first of a multi-part post series on the issue of provenance tracking in generative artificial intelligence systems and its implications for usage and assessment. This series is one output of a workshop hosted by NISO, COUNTER, and Cambridge University Press in May in support of greater standardization related to output tracking, recognition and assessment for AI systems. Other upcoming pieces in this series will cover the meeting and its outputs, as well as forthcoming community work related to provenance tracking and usage.
“We stand on the shoulders of giants” is a famous quote that sums up the nature of reference and why recognizing the contributions of the work and discoveries that others who came before us is so important. There’s a fascinating history behind that statement of Sir Isaac Newton, but its meaning shouldn’t be diminished even as Newton might have meant it as an insult against his colleague and scientific rival. Countless others prior to Newton recognized the value of referencing one’s predecessors to buttress an argument. It was a rhetorical technique described by the Greeks and the Romans, centuries before Newton began theorizing about how bodies move or the mathematics of change. The entire structure of scholarly communications is built upon how each piece of research connects with — and builds upon — other findings.

Despite this long history, our understanding of how content is connected to other content in a vast network of interconnected research outputs is being challenged by technological developments. Fundamentally understanding what AI tools are doing when they are linking back to sources is critical to evaluating them for research purposes. In a larger context, the extent that some AI tools have the potential to break the chain of research and how they then put things back together will be important to their adoption and people’s understanding of their trustworthiness. Failure to embed provenance and connecting future claims to their source could risk undermining the integrity of the scholarly record and even diminish the author’s motivation to share their results..
Let’s start with some definitions that are the most relevant to this discussion:
- Attribution – action of ascribing a work or remark to a particular author, artist, or person
- Reference – 1) provide (a book or article) with citations of sources of information or 2) mention or refer to
- Quotation – the act of repeating a passage, phrase, or short piece of writing taken from a book, play, speech, or similar source, and repeated because it is interesting, useful, or serves as an authority.
- Citation – a quotation from or reference to a book, paper, or author, especially in a scholarly work. Or (in Law) a reference to a former tried case, used as guidance in the trying of comparable cases or in support of an argument.
- Provenance – 1) The fact of coming from some particular source or quarter; origin, derivation. 2) The history or pedigree of an item (such as a work of art, manuscript, or rare book)
- Source – 3) a firsthand document or primary reference work
Many often use these terms interchangeably in day-to-day usage, but the distinctions are very important, particularly in scholarly communications. In recent years, the imprecise application of these specifics has led some into career challenges for some administrators and even occasionally gaming of the system for career advantage driven by less-than scholarly motives. We should carefully apply this same rigor to how we consider what AI systems are doing, how they function, and how we use their outputs. The historical record and the scholarly record are filled of primary source materials. These are the original ideas, papers, books, letters or other materials that convey ideas. An expanding list of AI tools are including links back to materials that were used in developing a response to a prompt. However, how these systems derive those links differ quite considerably. Whether and how these terms and their meanings are used is critical to how we determine how much we have trust in their outputs, as well as downstream implications on things like assessment, usage and value tracking.
As noted above, I attributed a quote to Sir Isaac Newton. I did this in the first sentence, without noting where specifically Newton made this comment, and based on the first paragraph, you couldn’t go back and check that he did, in fact, make that statement. First, what I wrote above is not an actual quote; it is a paraphrase of what he wrote. Specifically, Newton made this remark in a letter to a scientific rival, Robert Hooke in 1675. In that letter, Sir Isaac Newton wrote;
“What Des-Cartes did was a good step. You have added much several ways, & especially in taking ye colours of thin plates into philosophical consideration. If I have seen further it is by standing on ye shoulders of Giants.”
While this is a reference, in the broader sense of referring to the letter in which the quote comes from, it isn’t sufficient to find the source material. I might link to many of the places on the internet where this full quote resides, such as on Wikipedia, which for most people would be sufficient. This isn’t, however, what the scientific literature requires for an actual reference, in the sense of a formal link back to source material. For example, the above quote should be noted formally as:
Turnbull, H. W. ed., 1959. The Correspondence of Isaac Newton: 1661–1675, Volume 1, London, UK: Published for the Royal Society at the University Press. p. 416.
Although many attribute the quote to him, Newton wasn’t the first to use the phrase. Several variations date back to the 12th century. Whether we are seeking the original source of the text, the implications of its use, or some other purpose would depend on why we are referencing Newton’s use. The appropriateness of the citation is dependent on the use case. The use case of a quotation and how long it is are also important considerations in whether the reproduction is fair use or not. This is a legal distinction, which has other implications about how much can be copied.
At this point, one might think the post has become rather pedantic and just a lesson in scholarly publishing 101. To a certain extent, you might be right. Beyond the fact that we should always reinforce basic knowledge, it is vitally important to refresh this understanding as we consider what AI systems are doing and what the implications of the differences might be. This foundation is an important place to start, particularly as these tools start to gain significant traction in the scholarly world. Fundamentally, what generative AI systems are doing when they reference the literature is quite different from what humans do as we reflect on our knowledge or research. How the system’s usage of content reflects the contributions of the authors are both vital to the value that can be derived from these tools.
Let’s start with where this process might go wrong. One might write a paragraph about the protective effects of some vitamins to prevent a child from contracting measles (NOTE: Vitamins do NOT prevent measles!). Wanting to make oneself sound scientific or to attempt to make your claim appear more rigorous, they might then search the internet for some justification for their claim, and in that process they may or may not read the linked paper. They can then insert the source into their invented story and claim it to be scientific and supported by other research. As Maryam Zaringhalam, Senior Director of Policy, Center for Open Science, commented during a recent event, there is enough open content that looks scientific enough that will confirm whatever priors one might have.
Those who work in research would of course recognize this as being backwards to how real research is done. Unfortunately, it is not dissimilar to how some AI tools function (though certainly not all). A statement is made in a generative system and then citations are layered over the results after generation. This is not the same as how AI hallucinations happen, but there is often a lack of rigor in the processes of generating content.
One should be careful not to paint all AI tools as having this problem. For tools that are relying on specific materials, say for summarization or analysis across a constrained, selected materials, in a RAG model, for example, a user might have a more deterministic connection to source materials and therefore linking back to source content is not as difficult. Also, tools are rapidly developing, and many approaches to source linking are being explored, so this issue could be resolved for some — although not all — tools in time.
The foundational large language model (LLM) absorbs as much content and language as it can find to build a model for comprehending human communication, drawing out meaning from the connections between words and concepts. In that process, it absorbs and stores information, and it can regurgitate that in responses. Similarly, most people can connect Albert Einstein with the formula E=mc2. Few can explain what that formula represents and means. Fewer can reference the original paper that proposed the concept in 1905: “Ist die Trägheit eines Körpers von seinem Energieinhalt abhängig?” [Does the Inertia of a Body Depend Upon its Energy-Content?] in the publication Annalen der Physik on pages 639–641. Those who are familiar with the publication will recognize, however, that the formula itself wasn’t put forward in that publication, even though it is regularly referenced as its source. Like most humans, an LLM can confidently report that Einstein and the formula are connected, but without any definitive reference to the source.
If a human were able to read every article published in a field, instantly recall each item and link back to the source she read, that might be a perfect citation machine, combining human comprehension and automated recall. That is not, however, what machines do. AI tools do not “understand” what they are reading, in the way humans do — or insofar as we understand how human cognition works. What we do know is that human brains are not computing vector weights and comparing the statistical differentials between those weights in n-dimensional space. This is what machines are doing at scale, when they “read” content and attribute it to a source.
The connection can be done through different architectures, but it has a profound impact on the provenance, the resulting reference, and its connection back to the source. The expectations researchers, scholars and science-grounded users of AI tools have for how these systems incorporate these concepts will have profound implications for our community as a whole. Recognition for a person’s ideas is the most common form of compensation that authors of research receive for sharing their outputs. Without direct attribution, those authors lose the incentives to write. Usage has also been fundamental to value assessments of content subscription costs and the return on investment calculations related to open distribution for the past three decades. Citation collection and analysis have been a foundational part of research assessment — even despite its flaws — since the 1960s.
Building verifiability and source documentation into generative AI systems is essential to creating trust in these systems. For research, engineering, and clinical applications, verifiability and deterministic responses are requisite. If generative AI systems cannot accurately track back the source of claims in their outputs, giving attribution, credit, and usage reporting back, they will break the value chains and trust that underpins the scholarly communication ecosystem.
In the next post in this series, the co-organizers and I will describe a meeting hosted by Cambridge University Press and Assessment, COUNTER, and NISO last month, where these interconnected topics were discussed, along with the vital questions of tracking usage and assessment of content that is a source of generative AI systems. I’ll then focus specifically on potential work to help address the provenance and attribution problems.
Discussion
2 Thoughts on "Attribution, Provenance, Reference, Citation, and AI for Research Applications – Understanding the Differences"
Speaking of attribution, from what clever organization did you get that hat? 😉
It was an inside joke for those who attended the SSP AM in Chula Vista! Thanks for the hats.