There is an elephant in the scholarly infrastructure room and, while some are ready to talk about it generally, few want to describe that elephant in all its glorious detail. That elephant is the guidance organizations provide to the community about the use of persistent identifiers in our community. At present, the guidance is too vague and it needs to be specific, at least at a high level, in order for the national and international mandates to be most effective.
The August 2022 OSTP “Nelson” Memo laid out in general terms what it would take for content to be IDEALLY publicly available. This included when content should be released, and also its form and structure, suggesting that content should be made accessible in a structured form (i.e., XML or similar) along with associated “Digital Persistent Identifiers” (DPIs)—using the OSTP memo’s language, though these are more commonly referred to as persistent identifiers (PIDs)—and metadata. Because the memo is providing guidance for the numerous agencies impacted by the new policy so that they can craft their own plans, it didn’t provide explicit instruction on what those DPIs should be or the exact structure of basic metadata. It is anticipated that the affected agencies will then put forward their own specific plans, due to be submitted by February, for implementing these principles.
Within the sphere of scholarly communications there is a common understanding of the value of PIDs, metadata, and this infrastructure in improving things. There have been studies about their value, even in domains where their application might not be obvious. It is past time that we all agree on a core set of identifiers and basic metadata elements and begin to encourage researchers to use them at scale when communicating their results. In order to facilitate this, funders, publishers, and systems providers need to ensure that ease of use and seamless interoperability are achieved so as not to create barriers to adoption.
Persistent identifiers should be unique within the context of their domain. This is part of the guidance provided by the ISO Principles of Identification (ISO TS ISO/TS 22943). The introduction to that Technical Specification cautions that, “Any community of practice should carefully consider, and be appropriately cautious in adopting, any proposal that increases the number of identifiers used to deal with similar populations of referents.” This point bears specific focus, lest we get into the very familiar situation captured so well by XKCD (see featured image). Sadly, this has already happened and I fear it will continue to happen if agencies and other funders are not more explicit in their mandates for the use of PIDs in scholarly communication.
It is widely acknowledged that when people in our community say “Digital Persistent Identifier for People”, they are referring to ORCID. When people say “Digital Persistent Identifier for articles” that is generally understood to mean a CrossRef DOI. “Digital Persistent Identifier for institutions” is increasingly understood to be ROR identifiers. “Digital Persistent identifiers for a data set” is ideally linked to a DataCite DOI. For physical samples, that Digital Persistent Identifier is ideally the International Generic Sample Number (IGSN). For basic metadata, the 15 core elements that comprise the Dublin Core (or the ISO version) metadata is generally sufficient as a baseline for almost every research output. This is not an exhaustive list of the PIDs that are consistently applied. Arguably, the more detailed that list gets, the more problematic some communities will find the guidance.
Are there other identifiers that can lay claim to being a digital identifier in each of these domains? Yes, absolutely, there are. For institutions, there are ISNI, GRID, DUNS, Ringgold, Legal Entity Identifier (ISO 17442), VIAF, WorldCat Identities ID, organizationally unique identifier (OUI), or even a WikiData identifier (among many others). Similarly, for people there is a broad list of other identifiers, such as ISNI, VIAF, and WorldCat Identities, Scopus Author IDs, or even U.S. social security numbers and Facebook IDs. Each of these has their unique purpose, market niche, and use case, but certainly with overlaps. EMBL’s European Bioinformatics Institute (EMBL-EBI) runs the identifiers.org service, which provides information on 805 identifiers (as of today) used in scholarly publishing. Many of these identifiers are very specific to their domain, such as PaleoDB, which provides identifiers and taxonomic data for “plants and animals of any geological age”, the Addgene collection for information on plasmids, and Software Heritage, which provides a universal archive of software source code.
Will promoting a select set of universally adopted and recognized identifiers mean these other PIDs don’t have value or meaning? No, absolutely not. Each identification structure and system has its own role in the community, and specialist services are appropriate and needed for some communities. The notion that there is a universal set of identifiers that can apply in most cases need not suggest the idea that this set will work for every situation and every use case. Realistically, this is an example of the Pareto principle at play. For the overwhelming majority of cases, the existing and adopted structures of scholarly exchange will work best. The ambiguity of the edge cases shouldn’t preclude us from stating — ideally, mandating — the obvious for the 80+% of cases where they can readily be applied. In fact, it will be more of a problem if we don’t limit the growth of new PIDs where others already exist, before the list extends even further.
Institutional IDs provide an illustrative example. Back in the early 2000s, Ringgold was formed as a service to unambiguously identify institutional subscribers to scholarly content. The Ringgold ID and their institutional taxonomy was widely used by publishers in order to clean and maintain their subscriber lists. In 2012, when ISO published the International Standard Name Identifier (ISNI), the potential for using this system for institutions was already clear and within scope of the system. Although ISNI had originally been envisioned as an identifier for creators of content, primarily for the use-case of tracking royalties, it became clear that “creator” could encompass any range of parties, including people, pseudonyms, groups of people (bands), or even corporations. The NISO Institutional Identification (I2) recommendation was published in 2013 and, noting the publication of ISNI, recommended its use for this purpose. In 2013, Ringgold was among the first registration agencies for the ISNI system and most of the institutions assigned Ringgold IDs now also have ISNIs, the two are not the same. Meanwhile, Digital Science developed and provided the community a public domain release of its institutional service, GRID. Building upon this, members of the PID community led by California Digital Library, CrossRef, and DataCite came together to launch the Research Organization Registry (ROR), which was initially seeded with the public domain data from GRID. While GRID is no longer publicly available, it has forked from ROR and remains in use within Digital Science systems, and other systems as well. Ideally its use will continue to be deprecated. Some argue that the openness and community governance of ROR justifies the investment in creating and managing the system, which indeed it might.
A number of tools and services now exist to connect this network graph of identifiers and outputs, so that the interconnected world of science can be navigated. As one example of this, specific to organizations, OpenAIRE has developed the OpenOrgs Database, which can be used to “address the disambiguation of organizations.” One would think this would already have been solved by the proliferation of organizational IDs with different applications; instead, this growth seems only to have driven the need for more tools to solve the identification problem. Ideally, settling on a single approach would reduce overall costs and improve interoperability in the ecosystem.
What we may need is a community consensus document funders can reference, so that rather than funding organizations mandating these decisions, which they seem reticent to do, the community reaches agreement that, “what we mean when we say “Digital Persistent ID for ______ is _____”. The funders would then simply be following what most people understand as the most reasonable path forward. Of course, the consensus can, and probably should, include the caveat that there are domains of research activity that need their own special snowflake of an identifier. However, these cases should be limited in both scope and domain application.
Ideally, more organizations — publishers, as well as funders — should not just suggest this as is mostly the case now, but also make these types of identifiers and metadata mandatory, with limited exemptions for the edge cases that do exist. Of course, mandatory doesn’t necessarily mean that every researcher will have to memorize their organization’s ROR ID. Ideally, the systems will be integrated with APIs and lookup tools to manage this ID assignment and verification process with minimal effort by the researcher during the submission or application process. The lack of investment here has hindered wider adoption and application of PIDs. It would be extremely helpful if funders and publishers were to use the influence that they have to bring the community closer to consistent application of these infrastructure elements.
Daniel Sepulveda, Senior Vice President at Platinum Advisors, spoke on the path forward for this agency guidance during the Charleston Conference earlier this month during a panel on the OSTP memo and its implications. After the submission of the plans and their review, the likeliest next stage will turn to the Congressional committees for Commerce in the Senate, and in particular the Senate Subcommittee on Space and Science, and its counterpart in the House of Representatives. These committees will be involved in shaping the policy environment through their oversight and appropriations authority, and will have a significant impact on the eventual activities of the respective agencies. We therefore have a great opportunity to organize the community’s core understanding of these issues, perhaps through a consensus document that can be included in these policies by reference, before the plans get too formulated and then the activities turn to the legislative process. The reason for this is simple. As Daniel said, “To the people in Congress, an orchid is a flower,” and they have no idea what an ORCID is apart from a misspelling of the word for a flower. Our community needs to come together and express in policy guidance exactly what we mean, which is to say a small set of PIDs should be applied and we should name them specifically as “ORCID” (and CrossRef DOI and… and… and…) before someone from these committees and outside of our community insists that we use “DPIs” that they have defined and decided are most appropriate.
This extends well beyond the situation with the OSTP guidance and the plans being developed by U.S. Federal Agencies. This guidance should be the same in the EU and the UK, each with their funding mandates, as well as with guidance issued by cOAlition S and the Open Research Funders Group, and any other funders of research. One can easily envision a scenario where some significantly large domain or country decides that the existing scholarly infrastructure of PIDs and metadata isn’t quite right for them, and they create their own identifier for people or outputs in that domain or region. The unnecessary confusion, programming, and resource deployment necessary to navigate the increasing network of information will be significant. In the end, it won’t help anyone. To slightly misquote the XKCD cartoon, “Soon … There will be 15 competing PIDs.”