Metadata are scholarly publishing’s favorite vexation, our equivalent of the underperforming local sports team or ne’er-do-well relative – easy to criticize, easy to fix in the abstract (“stakeholder X just needs to do Y”), but stubbornly hard to improve in reality.
Part of the problem is that some metadata are well-established (e.g., articles being associated with their publishing journal), some well-established but error-prone (e.g., authors’ institutional affiliations), and others patchy and incomplete (e.g., funding information for individual articles). Our expectations for high quality metadata extend out to other fields – giving the sense that most metadata are wrong and that metadata in general are in a state of perpetual crisis.
I argue here that the way we view metadata assertions – as a binary attribute that is either ‘right’ or ‘wrong’ – does not reflect the underlying reality. Moreover, this view inhibits the adoption of approaches that would fill in desperately needed metadata for many other fields (such as funder information for public datasets).
We need to move to a probabilistic framework for metadata. In this framework, metadata are not ‘right’ or ‘wrong’. Instead, each assertion is accompanied by a statistic expressing our confidence that the assertion is correct. High quality metadata have very high levels of confidence. Metadata from less reliable sources or that have been inferred from circumstantial evidence have lower levels of confidence.
Let’s first walk through an (imaginary) example of how metadata are inherently probabilistic.
We have in hand a single-authored article that credits ’NSF’ for funding. We could at this point assert that the US agency the National Science Foundation is the funder – after all, the NSF is a massive funder of research and ~ 90% of the articles listing ‘NSF’ are indeed funded by the National Science Foundation. In the absence of any other information, we’d be right about the National Science Foundation being the funder nine times out of ten. To put it another way, we’d have 90% confidence that the National Science Foundation funded this piece of research.
Let’s add in more metadata: the author of this article is affiliated with a Norwegian institution. Although funding by the US National Science Foundation is still possible, it’s now much less likely – only 0.7% of articles with a Norwegian affiliation were funded by the US NSF in 2023. The ‘N’ in ‘NSF’ is now more likely to stand for ‘Norwegian’ or ‘Norges’.
Another piece of metadata – the article title – bears this out, as it’s a study of head injuries in Norwegian skiers. The funder NSF in this case is most probably the Norwegian Ski Foundation or Norges Skiforbund. A search of their grants database reveals that an individual with the same name and affiliation as the author received a grant from the Norwegian Ski Foundation several years ago.
The first take-home is that the intersection of three different metadata fields (the initials NSF, the author affiliation, and the entry in the grants database) led us to the right funder for the article. Each of those fields individually is not very informative, but the probability of a study listing ‘NSF’ funding that is a) about skiing injuries, b) is authored by Norwegian researchers, and c) by researchers who have a Norwegian Ski Federation grant not being funded by the Norwegian Ski Federation is very low.
The second take-home is that we could still be wrong. Errors arise in many ways.
For example, this author might actually have intended to write ‘MSF’, or neglected to provide their current (perhaps US-based) affiliation. Alternatively, we could have been misled by the entry in the Ski Foundation grants database, as that grant is actually to someone else with the same name at that institution. The chance of each individual error is often small, but in combination these errors can significantly undermine our certainty in the assertion. Forcing the assertion into a yes/no binary state brushes the uncertainty under the rug, where it could trip up some unfortunate future user of our metadata.
To move away from its current binary state, scholarly metadata should instead come with a provenance (the name of the entity that made the metadata assertion) and a confidence term (a measure of that entity’s confidence in the assertion). This structure has a number of benefits:
- Many organizations (particularly those using AI) could contribute new metadata about scholarly outputs. For example, DataSeer detects many instances of articles either generating or re-using online datasets. Since human curation is impossible at scale, these assertions about article-dataset connections have AI-related errors. We do not currently pass these assertions to e.g., DataCite, because we know that some are incorrect. A probabilistic metadata framework would allow us to signal our level of confidence in each assertion & enable us to make a more useful contribution to scholarly metadata.
Contributions from a wider range of organizations (particularly those using AI) will be crucial because these organizations will expand metadata into new territories and fill in major gaps – the fact that some of the assertions are lower confidence is outweighed by our ability to quantify and monitor new aspects of research. - Attaching confidence terms to metadata also allows (expert) users to adjust their tolerance for errors according to their needs. For example, a survey aiming to capture as many articles re-using datasets as possible would tolerate more false positives than a survey quantifying a researcher’s output for a particular time period.
- Confidence terms would also enable contributor organizations or even researchers themselves to identify parts of the academic research graph where adding other metadata fields would increase confidence. In the (imaginary) example above, metadata from the grants database boosted our confidence that the ‘NSF’ mentioned in the acknowledgments was indeed the Norwegian Ski Foundation. In the absence of confidence terms, it is impossible to locate these areas of scholarly metadata. A good example is described here.
Major metadata organizations such as DataCite already have confidence terms for some assertions, but these are hidden within their internal systems. Bringing them into public view would raise broader awareness that scholarly metadata are inherently probabilistic and could – maybe – usher in a new way to connect scholarly outputs.
Discussion
8 Thoughts on "The Case for Probabilistic Metadata"
Thanks for this interesting take, Tim – but I’m a bit surprised that you didn’t mention the critical role of persistent identifiers in helping remove a lot of the sort of uncertainty you describe (eg, if a ROR or Crossref Funder ID had been used for NSF, the correct funder would be known). Making provenance information about the metadata associated with PIDs readily available is also critical, as you say; happily this is increasingly happening, for example, through ORCID’s Trust Markers (see https://info.orcid.org/interpreting-the-trustworthiness-of-an-orcid-record/).
Hi Alice – you’re absolutely right, I should have put in more about PIDs as these do massively reduce (but not remove!) the uncertainty around a piece of metadata.
Tim, Alice, I had the same thought about PIDs, but I think one point the post is trying to make is that there’s a lot of existing metadata out there that hasn’t used PIDs, and in essence could be cleaned up, connected or linked, to enhance the scholarly record, and make the research FAIR (Findable Accessible Interoperable and Reusable), and machine readable ?
Adopting PIDs early in workflows, such as during the submission process is great, adding a drop down for ROR Institutional IDs and FundRef is great, but practically, for some publishers, making these workflow changes seems to take time, and can be a major process changes – Having said that, I think most service providers and peer review systems do get the value of PIDs early in the workflow, but that’s not going to change all the historic metadata that’s out there, that doesn’t include PIDs – perhaps that’s a separate blog post and challenge ?
Alice, I agree that PIDs are very helpful and in Tim’s model, as in the ORCID discussion, the assigned PIDs also has an associated confidence level. The case of the acronym “NSF” is interesting because it is so common. Unfortunately, there are thousands of records in Crossref and DataCite where this acronym has been associated with the wrong funder ID somewhere along the data journey (see https://doi.org/10.59350/cnkm2-18f84). Community metadata corrections with associated confidence levels would help us all decrease the impact of these errors.
Thanks for highlighting the inherent uncertainty associated with any kind of metadata enrichment via inference. Beyond the cases mentioned, this of course also applies to well-established processes such as identifying citations and disambiguating author names.
What you propose is something I would welcome, however, the issue I see with the proposed approach is that confidence scores are inherently subjective in the sense that they are deeply tied to the particular model used to generate a metadata assertion. Calibrating these scores so that they are on a reasonable scale is difficult enough, and comparing them across systems seems to me even harder. Thus resolving conflicting assertions coming from different data providers will not be an easy task. Reminds me of the problem with merging results from federated searches using different search engines, which never worked satisfactorily.
Whatever the attribution of “NSF” means in this example, I have very strong doubts that anyone can assign a probabilistic value to it that means anything more than their guess. What’s important is to have metadata about the metadata: not just where did the “NSF” attribution originally come from, but also if you’ve done research and from context you think it’s likely to mean “Norwegian Ski Federation”, then who made that judgement and why? Probabilistic values are scientistic: what would really be more useful to people is some form of free text metadata that is based on a wikipedia model of people adding successive annotations and conclusions that do not displace the original, or at least a documentation of the process by which someone made these judgements for a database.
Hi Rich – thanks so much for this comment. I should start with the caveat that this is largely a ‘think piece’ that’s intended to put forward a different way to approach metadata, rather than an implementable framework.
That said, it may not be too hard to come up with some simple probabilities for the NSF example. A journal published 500 articles in the past two years. Fifty of those were funded by the US National Science Foundation, and three other articles different funders with the acronym NSF. The probability that the next article to be published is funded by the National Science Foundation is 10%, and the probability that it is a different funder with the acronym NSF is 0.6%.
One can then keep subdividing the corpus to arrive at other probabilities: What proportion of the articles with a corresponding author with a Norwegian affiliation are funded by the US National Science Foundation? And so forth…
Tim, I really like this idea (and appreciate the positioning as a thought piece). I’ve come to a similar conclusion in a related area of matching organisation names – for example, pulling in usage data from a variety of disparate sources (each of which may use a variety of PIDs or text fields to list the name) and trying to map them to a consistent organisation. (And this example could be repeated with many other types of metadata). While we all strive for the goal of using PIDs, the reality is obviously messier, and it’s often not feasible to invest the time/cost in cleaning up data. The huge volumes involved often mean that purely automated solutions are the only practical option. Perfect is the enemy of good enough in these situations, and so the idea of taking an automated best guess and assigning a confidence level is very appealing. Different users of the data can then use that confidence level to determine how to use the data e.g. user A is happy to take it all, while user B only uses names with a 90%+ confidence level.