Editor’s Note: Today’s post is by Jamie Carmichael, Jessica Thibodeau, and Chef Roy Kaufman. Jamie brings nearly 20 years’ experience in publishing to her current role as Senior Director, Information & Content Solutions at CCC. Jessica is Senior Director, Information and Content Solutions at CCC, responsible for the strategic direction of CCC’s Ringgold portfolio and go-to-market efforts for products and services across the scholarly publishing ecosystem.
As scholarly communication rapidly adapts to seismic shifts in open science, technology, and culture, a renewed focus has emerged on metadata and persistent identifiers (PIDs) — about people, places, and objects — as an essential component of a vibrant industry. At the US policy level alone, leveraging metadata to accelerate industry transformation is a common theme across the Nelson Memo and recent Requests for Information from the NIH and the Department of Transportation.
Scholarly research is complex and interconnected; change in one area can spark improvement or deterioration throughout the ecosystem. By way of example, consider the role of PIDs in open access (OA) funding entitlements. OA management platforms rely on metadata elements, particularly organizational PIDs passed from upstream submission and peer review systems, to automate the process of matching manuscripts with potential funding sources. This typically happens at article acceptance, and increasingly at submission, eliminating manual administration for authors as well as supporting publishers, institutions, consortia, and funders in achieving OA at scale.
In order to perform a health check on organizational IDs, in 2021, we reviewed cross-publisher records of institutional affiliation and/or funder data in our OA workflow tool, RightsLink for Scientific Communications. We discovered that 82% of accepted manuscripts included such data, which was an improvement over prior years. However, these statistics masked an ugly truth; namely that in many cases those manuscripts used institutional email domains as a proxy for funding or discount eligibility instead of a PID. And within the 18% that carried no PID, missed funding opportunities created unnecessary work (and payments) for authors, institutions, and publishers to reconcile retroactively.
Even if CCC — either alone, or with its partners and publishers — were able to close these metadata gaps at acceptance of manuscripts, this is late in the process and advantages of PIDs earlier in the research lifecycle would be lost. Solving for metadata gaps would be more effective in upstream systems of record so the tail doesn’t wag the dog. This is precisely why we’ve encouraged the NIH to consider the grant application process as an early opportunity to mandate PIDs and cascade to other systems underpinning the research lifecycle, for example, Current Research Information Systems (CRISs).
But where to start? Let’s face it, PIDs are a wonky topic and we need to communicate to people who are not naturally interested in the intricacies of, e.g., ISNI and Ringgold. But these people will care if they know that lack of PIDs can lead to lack of funding. In order to break this down, we recently talked with dozens of stakeholders and mapped a range of metadata challenges through an OA lens. We built on an existing body of work to visualize the ripple effect of a fragmented metadata supply chain. The result is an interactive report of the research lifecycle designed to offer everyone a deeper understanding of the state of scholarly metadata in 2023. Though the issues are numerous, they are not insurmountable, and much infrastructure exists to support change.
About the State of Scholarly Metadata: 2023
Working with Media Growth Strategies, we interviewed representatives from institutions, publishers, funders, researchers, service providers, PID providers, and industry associations to capture a broad view of the current state of metadata and PIDs across the ecosystem. We asked questions such as:
- Who should create and maintain metadata? Where should it originate?
- What resources do you invest to create, curate, or maintain various types of metadata?
- What are your biggest challenges when it comes to metadata management and/or use of PIDs?
- What are the most critical metadata elements?
- What’s at stake if these elements don’t persist through scholarly communications?
- Who should own metadata quality and control?
Here is what they said about the costly implications of metadata breakages and complexities across the research lifecycle:
- Researchers: There was overwhelming consensus among stakeholders that researchers shoulder a significant administrative burden to assert or re-assert data (e.g., institution affiliation, funder ID), ultimately disrupting and delaying scientific discovery.
- Institutions: Because of metadata inconsistencies throughout the research lifecycle, institutions deploy labor-intensive workarounds to manually reconcile funding eligibility and APC billing, as well as normalize unstructured data across disparate systems for comprehensive analysis.
- Funders: Missing metadata (e.g., registered grant DOIs, institution affiliation) makes it difficult and costly to link funding to research outputs, presenting potential barriers to open access uptake, problematic impact tracking, and incomplete analysis to inform future investments.
- Publishers: Metadata breakages interfere with business transformation initiatives, contributing to high operational and opportunity costs and complicating fulfillment of open access agreement terms and analysis of deal performance to inform future decisions.
Many stakeholders we interviewed recognize that new metadata strategies, inclusive policies, and a robust framework of interoperable systems are essential for modernizing this element of scholarly communications. It’s also clear that an ecosystem-wide commitment to improving data quality across all groups will facilitate the transition to open while helping to preserve research integrity, expand discoverability, and improve impact measurement. If the industry works collectively to shrink these gaps by reexamining metadata policy and practice, stakeholders will undoubtedly feel less pain. Or, we can continue the current system of entropy, friction, and frustration. Together, we can decide our path.
Discussion
6 Thoughts on "The State of Scholarly Metadata: 2023"
Couple of reflections:
1. Wow! Don’t APCs add significant complexity (and therefore cost) to an already complex system; no wonder it’s painful for stakeholders.
2. This model is for research that ends up published in journals. It would be super interesting to see the equivalent for research that’s published as grey literature (where the current state of metadata is, to put it politely, in a real ‘state’, aka a typical teenagers’ bedroom).
Thanks for this, Roy – I look forward to digging into the report itself. In the meantime, I wanted to add that, while it’s easy to see PID mandates as part of the solution, they only work if it’s easy for everyone involved to comply – and if everyone can see the benefits. At the moment, neither of those things are the case across the board – for all sorts of legitimate reasons, including not enough integrations with key systems, lack of consistent outreach to researchers, limited understanding/support for/investment in PIDs at the leadership level, associated metadata that is poor/missing/incomplete, etc. I fully agree that registering PIDs and adding metadata as early as possible in the process is the right way to go (and I believe ARDC’s Research Activity Identifier, RAid – see https://ardc.edu.au/services/ardc-identifier-services/raid-research-activity-identifier-service/ – will be a big help there). But I think our first step should be to focus on making PIDs work better and ensuring they are easier to implement for everyone!
So sorry- I should have said many thanks to Jamie and Jessica too!
Thanks very much for this feedback Alice and for your contributions to the study early on. We hope this work can help support discussions focused on ways to make PIDs work more effectively and always welcome feedback to that end.
Fantastic work done by the CCC team, in highlighting the persistent bottlenecks we all need to address to make research discoverability and for the potential of the scholarly body of literature and data to actually be searched through meaningfully for knowledge building as well as transfer towards societal benefits, implementation, and product development where applicable; let alone increase efficiency in the publishing process, of course, which will liberate funds to where they are desperately needed.
Referring back to my previous TSK article on mapping OS resources (https://scholarlykitchen.sspnet.org/2023/04/19/guest-post-mapping-open-science-resources-from-around-the-world-by-discipline/), of which open infrastructure and PIDs, of course, are integral components – as is this tool mapping metadata gaps (https://kumu.io/access2perspectives/open-science#disciplines/by-os-principle/the-state-of-metadata) -, see an overview of all kinds of research-related and research-applicable PIDs at https://kumu.io/access2perspectives/open-science#disciplines/by-resource-type/persistent-identifier?focus=1
Thanks so much Jo. We look forward to continued discussions on these important topics. And kudos on the great work you are doing to map the open science resources.