In its earliest days, scholarly publishing developed out of the exchanges where people were seeking to distinguish themselves from the charlatans or supposed alchemists who made wild claims about transforming materials into gold or such like. People would have to prove their results and were sharing them so that others could replicate them. Later, the formal peer review process began as a way to formalize a vetting process prior to publication and thereby enhance trust. Today, hundreds of years later, it is now technically possible to turn lead into gold, albeit far more expensive than it would be to just go buy gold directly. Unfortunately, and for different reasons, we are returning to an era when validation and verification need to become key aspects of our community. A variety of research integrity issues have been the focus of much attention. Scientific communication needs to return to an even more skeptical footing. We could do well to take a lesson from IT security and move scholarly communications to a zero-trust framework.

In recent years, the information security community has focused on the concept of a trustless security. Zero trust assumes no implicit trust should be granted to assets or users based solely on their physical or network location (such as internal networks or the wider internet) or based on asset ownership (whether it is enterprise or personally owned). In a zero trust architecture, systems are built with the presumption that everyone interacting with a system has been compromised and therefore must validate and prove their identity as they interact with the network. No one is presumed to have privileged access and every engagement should be verified, monitored, and auditable for security purposes.
One of the original sins of the internet (you’d be amazed how many sins there are!) is the lack of verifiable identity for network interactions. Verifiable systems for credentials, identity management to prove provenance and authorization, were missing in the first days of the early computer networks. These were not issues worthy of the IT infrastructure to manage — though the technology was in no way robust enough to handle this in its early days — in a small lab where few people had access to the network, or even to the devices attached to it. Why would you need to validate the people on the network, when you knew them all personally. These days are — or at least should be — long gone.
In information security, momentum has grown behind a framework called a Zero-Trust Architecture (ZTA), defined by the National Institute for Standards and Technology (NIST) security team in 2020. This model developed out of years of discussions around network security dating back to at lest 2010. Recognizing that scholarly publishing is facing a range of security-related issues, during the STM Future Labs meeting in December, Chef alumnus David Smith made a comment that perhaps we should consider the entire STM publishing ecosystem as a zero-trust environment. Building on David’s suggestion, I thought it important to bring to reader’s attention, both what a zero-trust security framework is and how it might apply to scholarly communications processes.
There are many elements to a zero trust architecture and there is no one correct or complete set of principles that define this entire framework since it grew out of NIST five years ago. Over the years it has been applied in a variety of contexts and has been built upon. The basics principles can be distilled to a few core elements, which include:
- Identity Curation – How identity validation takes place, and what authorizations that provides people with needs to be a significantly more robust process. Varying levels of access related to identity and role management needs to be implemented and maintained, with both on-boarding and off-boarding of identities occurring regularly.
- Least privilege – Users at all levels should have privileges and authorizations to do a limited set of tasks. For a given activity, users should be authorized with the minimum level of privilege to conduct a task, with higher-level authorization preserved for when that access is necessary.
- Continuous verification – Even if credentials had previously been validated, this shouldn’t ensure continuous access. Credentials should be reverified over time and at key action interaction points to ensure there is no hijacking of sessions.
- Implement dynamic policies – Policies should robustly adapt and change, if necessary, based on circumstances, such as policy shifts or behavioral and environmental attributes. For example, because of a constantly evolving threat landscape, at points of heightened threats, heightened security should be easily implementable.
- Automation and orchestration – The user experience of a zero-trust framework needs to facilitate compliance, not drive users to take potentially security-compromising workarounds, such as pasting your passwords to the device, or incredibly simple — and therefore hackable — passwords. Similarly, networked resources, either people or assets, need to support secure engagement.
- Device access control – Network actors must be assured of who has access to what device and under what circumstances. Systems should be implemented on networks to ensure device access hasn’t compromised.
- Monitor and measure integrity through analytics – More closely monitoring what is taking place on a network, including the security posture of the elements of the network, by analyzing patterns of activity on services and assets.
- Presume breach – Always assume networked systems have already been breached and act accordingly. This includes regular identity validation, monitoring of behavior, asset engagement, and device management as discussed above.
Some of these aspects of a zero-trust framework easily map to the publishing processes of content creation, protection, and secured access of resources and we are very familiar with them. The subscription side of content distribution has long been acquainted with trying to secure access and prevent ‘leakage’ of content outside of the walled gardens of content hosting systems. Attempts at securing content with digital rights management (DRM), content watermarking, and various authentication protocols, are all examples of asset protection. Authentication of users, first with passwords, then via IP-authentication are now moving toward identity federations, with tools like SeamlessAccess. It should be taken for granted that IP-based authentication, a long-standing practice of authentication, is extremely problematic and, while user-friendly, is a gaping security flaw. Eventually, access controls will move more into the direction of digital identity wallets, user tracking technologies, and embedded features of content provenance. Essentially, these are all advances on content security, which is one component of a zero trust environment.
One might think that these problems are resolved by moving to open access publication. However, this is quite a simplistic view of the zero-trust architecture, which encompasses the entire ecosystem, not simply the point of delivery of a service. In an open access publishing model, the issue of subscriber access is only one component of the process where authentication and authorization are avoided. Thinking more broadly, how can the authenticity of the identity and the affiliation of the author be ascertained? During the peer-review process, how can reviewers be assured that the images/data sets/results haven’t been tampered with to conform to some hypothesis? Within the network, how can readers be assured that the content hosted on the publisher’s servers, or in a library’s repository haven’t been manipulated? Or if content is removed and hosted elsewhere, is it still valid? If we want to ensure reader privacy, are we certain the communications on content are securely delivered? Each of these security issues touch on other elements of a ZTA that apply to our ecosystem, even if authorization and access control are absent.
While there is movement to move access control to outputs out of the realm of security and identity management for some forms of content, this is not the only reason identity is important. Even if the output is free, there is reason to control, or at least verify the inputs. Beyond this, not all content, not all tools, nor all services will be freely available to everyone. If a researcher wants to post content onto arXiv, they need to be vouched for by three colleagues in the domain before they can post materials. The supercomputer resources of an institution might be free for some members of a community that might even spread across institutions, but access isn’t open to the general public. Some content might be sensitive and should not be broadly available for various reasons, including national security, privacy, cultural sensitivity, etc.
Recently, my organization received a note regarding an author who claims to have had someone impersonate them and having submitted a fake paper in order to smear them. They sought NISO’s help in seeking support in rectifying the situation. I can’t speak to the veracity of the request, the players involved, nor is NISO in any way involved in the situation, so we won’t be engaging. Without lending credence to the individual’s claims, without more secure identity management, these things could happen or simultaneously could provide a convenient excuse for something that did in fact happen. Everyone should be aware that spoofing, spearfishing and identity capture are all things that regularly happen on the Internet. Information security is a serious problem, and impersonation is just one example of the challenges. One would think there are more nefarious things one could do with hacked credentials than submit papers to smear a person’s reputation. Then again, people are creative in their maliciousness. Also, people are known to do problematic things when submitting a paper. Claiming “someone else did it; it wasn’t me,” might be a reasonable defense that might avoid more serious consequences. The fact remains that without verifiable identity, there is no real telling what did happen.
Last December, the MoreBrains Cooperative with the support of the STM Association produced two important reports related to this concept The first is on trusted author identity. The report highlights that “the scholarly record [is] at risk, [and] there is an urgent need to strengthen identity verification without creating barriers that hinder legitimate researchers.” Anyone can setup an ORCID account and be assigned an identifier, which many systems rely on for identity verification. Even within that system, one could easily create fake information about that record. How would anyone know whether I attended Johns Hopkins, UCLA, or MIT (only one I actually did). Certainly, the institution would know, but if I assert that I have, outside of a formal job application process or security screening, is that information confirmed? Similarly, if I say my affiliation is with NISO, one might easily validate that. However, if I claim I have an affiliation with the University of Chicago or Microsoft (which I don’t), how would that be vetted? If someone submits a paper and identifies themselves as having a connection to some prestigious institution to gain some advantage, there hasn’t been a process to validate those claims. Certainly none of this is a critique of ORCID or how it has been architected, simply how it is being used in our ecosystem. Movement toward creating an ecosystem of trust markers based in the ORCID system could address this and should be adopted. Similarly, as STM guidance suggests, simply using institutionally validated email addresses might be another start.
Another area where provenance could be important, even integral to the scientific process is in validating research outputs. One of the drivers behind the research data sharing movement has been to allow others to validate the results of an experiment. In reality, data too can be faked. There are several notorious examples of data being faked to support research articles. Many examples of altered images exist that have led to retractions.
This is the focus of second important report released by the MoreBrains Cooperative on content provenance and image integrity as solutions to detect image falsification. This second report examined several approaches to ascertaining the feasibility of different approaches to image integrity systems. Importantly, the report recognized that fitting the scholarly ecosystem into existing or developing technical approaches makes sense, but that the social and adoption frameworks are more challenging than the technical ones (although technical problems still exist). One potential solution of this is to automatically connect resulting data or images back to their originating device. In the consumer electronics world, an effort to advance a metadata model for content provenance, the Coalition for Content Provenance and Authenticity (C2PA) model is gathering momentum. It will be interesting to see if the scholarly community adopts variants of C2PA for connecting research devices with outputs, for example.
A valid critique of this architecture is that it could invade a reader’s intellectual freedom, by imposing tracking of the user’s behavior and therefore impinge their privacy. Certainly, this could be the case. However, it is possible to disconnect identity verification, authentication, and user behavior on a network. This is a core principle of federated identity systems as part of single sign on systems. In this model, a user is authenticated on the network in one environment and then the credentials, which could be anonymized, are shared with a service provider. The process of identity verification and the process of delivery of services or content can be distinct and separated. Fundamentally, this needs to be a requirement of how these systems interact and thus structured so from the outset, but it is possible.
If the community is to move forward with a zero-trust architecture, an awareness and commitment to privacy protection should be a core element of implementation. This would include minimization of data collection, only gathering essential user attributes to provide a service. Privacy-preserving access control would also include reducing risks through the use of distributed access control models to avoid centralized surveillance points. Logging policies should be documented, and users should be informed about what data is collected as well as how they can access their log data. If AI models are used for predicative analytics to score risk, care should be taken to ensure that results are both explainable, human-monitored and free from bias. Of course, along with privacy impact assessments, these elements are core to regulatory compliance for policies like GDPR and CCPA.
The scholarly publishing process — manuscript submission, peer review, editing, distribution, and preservation — involves multiple actors, intellectual property, and a high demand for security and integrity. ZTA principles can be applied to enhance security, protect data, and ensure greater trust in the entire ecosystem. This can apply to ensuring author identity in the submission process, as well as securing pre-publication research, virus checking, and peer-reviewer anonymity. Ensuring long-term preservation, and content authenticity are also elements of this process that can be enhanced by applying ZTA to scholarly publishing. Fundamentally, ZTA principles aim to establish trust in the identity of the people on a network, the content that they interact with, and the security of the network interactions that take place. Adopting them more robustly could help address some of the challenges we’re presently facing.
Discussion
2 Thoughts on "Scholarly Publishing Based On a Zero Trust Architecture"
Thanks Todd, for this informative overview of ZTA. There is a lot to respond to, but we’ll address some of your points related to how ORCID is being used in the research ecosystem. It’s true that the centralized trust model hasn’t been working for a while, in fact, we’d argue that any model based on a single point of identity verification is flawed and likely to become a focus for circumvention. ORCID supports a decentralized trust model, or what might be termed a “web of trust”. You are correct in stating that ORCID allows researchers to self-assert claims about their identity and scholarly track-record — this is important for inclusivity and to avoid erecting barriers for new scholars or those who work without formal institutional backing. However, baked into ORCID’s architecture since the start is the ability for trusted organizations to also add assertions — about the identity, affiliation, and research outputs of a researcher, to ORCID records. This always happens with the researcher’s permission, and we stringently record the provenance of each assertion so that users of our data can see “who said what about who”.
Let’s take your example. I can see on your ORCID record that you indeed attended Johns Hopkins, but the provenance metadata accompanied by that data tells me that you added it yourself. Otherwise, it would be accompanied by a “trust marker” — a green check mark in the ORCID interface. If I were reviewing your ORCID record in the process of processing your application for, say, a grant proposal, I would not automatically assume you were lying about that. As we tell people, there are MANY legitimate reasons for someone to self assert data on their records: perhaps the organization they’re referring to hasn’t added that data to their record, or the organization doesn’t exist anymore, or they’re recording an historical affiliation. Instead, if I were reviewing your record, I would take your Record Summary as a whole, and see that F1000 has added Peer Review Data to your record, as did Patterns, via the Elsevier Editorial Manager. Of course, this wouldn’t automatically confirm that your Johns Hopkins affiliation data was valid, either. But it is a signal that you are a legitimate researcher and have a track record of contributions. The more organizations that have added validated data to your ORCID record, especially over a long period of time, the more likely you are to be who you say you are.
We have never advocated the use of a “bare” ORCID record alone as a signifier of validated identity. Instead, it’s necessary to look at the data on the record itself, and its provenance, and make your own decisions about whether it is sufficient for you to believe that the identity of the researcher is true, or whether you need to take other, perhaps more invasive, identity verification steps for your particular use case. Last year we implemented a record summary view to make it easier to view the green tick “trust markers” a record contains and make decisions based upon them.
The good news is that we are making strong progress. With the help of our community, we’ve already reached the point where nearly half of all active ORCID records contain at least one trust marker. With the rapid take-up of our recently-released validated professional email domain functionality, we expect this to continue to increase in the coming year.
In our view, organizations across the scholarly ecosystem have a responsibility to disseminate the verified information they hold about researchers — whether that’s affiliation, publication or funding information. That is why we advocate for our member organizations to contribute the data they hold about their researchers into their ORCID record, so that that data, accompanied by trust markers, can effectively be propagated throughout the other systems that researcher interacts with. It’s all of our responsibility to contribute to this Community Trust Network.
Chris,
Thank you very much for the detailed response. The post was already long enough without going into detail about the excellent trust marker work that ORCID has been doing – certainly something that is worth another post as well! As noted, STM has a full 29-page report on this topic.
ORCID is a great foundational system, upon which trust can be built, but having an ORCID alone isn’t sufficient. As you note, the validation of the information in the record provides markers of trust that people can begin to rely on. What you describe, a network of trust, is an excellent approach. Though, it should be noted, it isn’t the only approach. For a variety of reasons, institutions may wish to use their own approach to identity management, which should be fine, so long as the systems interoperate sufficiently to support user engagement across the ecosystem. An interoperable blend of technologies and strategies is needed to build out a zero-trust ecosystem for scholarly publishing.