The journal brand has proven to be the great intangible asset of the scholarly publisher. It signals trust and authority to authors and readers alike. So even as libraries came to license bundles rather than discrete titles and users came to discover and access content through platforms, publishers have worked hard to defend the journal brand and extend it, for example through cascades and author workflow integrations. The version of record, which publishers typically control exclusively, has been their vehicle for doing so. But everywhere you look, the version of record is declining in relative importance, as interest grows in preregistration, datasets, preprints, source code, and protocols, among other elements of the scholarly record. Looking ahead, we see real tensions emerging in how the scholarly record will be structured and who will have ownership and control over it.
What are the opportunities and challenges as publishers seek to extend the reach — and value — of their journal brands by supporting research materials beyond the version of record? Digging into the evolving context of preprints and research data offers valuable clues.
Preprints are inherently neither beneficial to the scholarly communication domain nor threatening to the established publishing networks. As Oya has analyzed, preprint services face a range of policy issues involved in building trust and durable business models in a contested information environment.
One strand of preprints, and the original motivation for this format, is in the informal sharing networks among research communities. arXiv is a well-known example of this model, starting with high-energy physics and then extending to a variety of adjacent fields in physics and mathematics. SSRN began with a similar approach focused on business and law. Services like these quickly became important components of the community infrastructure for scholarly communication in the fields that they covered. Initially, publishers entered into an uneasy truce with these preprint communities, recognizing that they could become a source of potential competition depending on how they developed.
More recently, as Oya and Roger analyzed in the spring, an alternative vision for preprints has emerged, one pursued by all of the major commercial publishers, among others. In this new model, publishers are promoting preprints but at the same time working to domesticate them, bringing them within their article submission workflows and linking preprints and versions of record in a way that will over time serve to deprecate the ability of the former to disrupt the latter. By restructuring the place of preprints less as part of a global research community (for example, for high energy physics) and instead linked directly with journal brands, publishers hope they will reinforce the existing value proposition. It remains to be seen how this vision will dovetail with, or perhaps over time impede, the mandate of community-based preprint services such as arXiv and bioRxiv to provide publisher-neutral platforms, decoupling the early sharing of research from the formal publishing stage in a way that enables authors to avoid having their findings associated exclusively with specific journals.
If anything, the landscape for research data is more complicated than that for preprints. It has come to include domain-specific structures, cross-institutional generalist structures, and increasingly substantial institutional investments. There are also some interesting new models developing for dataset discovery and capturing datasets within records associated with researcher identity.
The research data landscape is currently characterized by a vast array of domain-specific repositories. Many of these were developed from the ground up through the work of what Danielle has termed scholar-led data communities, which share a certain type of data, typically across disciplinary and institutional boundaries. Some data communities persist over decades while others may emerge and dissipate more quickly in response to specific research directions and specific societal needs, and we’ve profiled a number of both established and emerging data communities at Ithaka S+R. It is typically best practice for a researcher to use domain specific repositories whenever possible in recognition of the importance of maintaining close relationships between the data and the scholars who use it.
While institutional data repositories have not emerged as a dominant approach for scholars to deposit their data, there is growing investment by individual universities to address enterprise needs for data security and compliant storage both for administrative and research data, along with an array of other institutional research data services. Institutional models are far less a factor for preprints. There is also a movement to provide data curation services cross-institutionally through the expertise and leadership of information scientists, such as through the work of the Data Curation Network in the U.S. and Portage in Canada.
Many publishers have a keen sense of the growing importance of sharing research data — 2020 was, after all, the year of data — but have struggled to understand if research data will become a meaningful part of their business. The research data landscape includes a number of well known generalist repositories, some of which are owned by publishers. Partnering closely with the generalist repositories makes sense given the cross-institutional and cross-disciplinary nature of publisher infrastructure at scale, as well as the prospect of linkages into publisher workflows. Thus far, however, few seem to be pursuing models that would incorporate research data into publisher-specific services and workflows, as they have been doing with preprints, or other mechanisms for extracting value from them. Perhaps this is in recognition of the massive complexity of research datasets, which includes everything from privacy and other ethical factors to metadata description and standardization, far more difficult than the long-derided but ever-durable PDF.
An emerging strategic direction is for publishers to focus on ensuring data sharing compliance. DataSeerAI is a promising example of how a tool can be built out for publishers to offer better services in this space. Another approach is to have tighter control on data sharing policy, as evidenced by how some publishers are involved in advocating for specific repository selection criteria. COAR offers a critique of this move, arguing that it will enable publishers to set a bar for data repository compliance that will privilege criteria that only their own commercial, generalist-oriented offerings can meet.
In contrast, few publishers have built strong partnerships with data communities, and even fewer have identified models to enable, support, or even provide services to data communities. In these respects, among publishers, scholarly societies may have an advantage in being able to connect data communities and other relevant research records with their publications. The trend of some publishers preferencing generalist repositories and seeking to more tightly control repository selection criteria also arguably will not help to foster the broader development of data communities and may even serve to impede it, given the need for researcher communities to take the lead in defining and delimiting how data sharing is useful for them. Brian Nosek also reminds us that too strongly intertwining data sharing with publisher interests runs the risk of exacerbating publication bias, in contrast to more expansive approaches to data sharing, which encourages researchers to share data regardless of the results.
The scholarly record is fracturing, as shown by these twin examples of preprints and research datasets. Publishers are pursuing an effort to integrate preprints into their workflows and value propositions, but whether they will succeed in doing so remains to be seen. They seem to be far less certain of how to similarly integrate research data, which does make sense given that datasets correspond less directly to the published article than does a preprint.
To truly engage with other research artifacts from a workflow perspective, publishers need to invest not only in bilateral connections with the version of record but also develop a network with the researcher, the laboratory or other research team, and the research community more broadly. Only a few major publishers appear to have either the scope or the field-specific depth to take on such a project. Perhaps a white label service is needed.
For the publishing sector, this fracture seems to pose challenges. Those parties that are concerned about consolidation and profit margins in publishing might see in these challenges an opportunity. While perhaps unrealistic, as a thought exercise, we wonder what it would look like to make a large-scale capital investment in promoting the fracture? Might scholarly societies or others interested in stewarding research communities find a way to promote a refactored scholarly record?