In recent years, mechanisms for sharing and preserving research data have grown considerably. But the landscape is crowded with a number of divergent models for data sharing. And because these divergent approaches to research data sharing are poorly distinguished in much of the discourse, it can be a confusing landscape. Some are driven by the needs of science, some by business strategy. Today, I propose that two fundamentally competing visions are emerging for sharing research data.
The publications vision
One vision is highly publisher-centric. It involves efforts to connect datasets to the publications with which they are connected. The publications vision can take several forms. First, it is mandated by many funders and publishers, , sometimes out of an interest in reproducibility. Second, it is being developed by publishers with a strategic view of how the publication system is evolving.
Research funders have a substantial interest in data sharing. Ensuring the reproducibility and replicability of published results is a vital priority, one that clearly deserves greater attention and investment. Building in some cases on what they perceive to have been their successes in driving the open access movement through deposit mandates, several major funders have been initiating data deposit mandates.
When associated with reproducibility, funder data deposit mandates are sometimes triggered by the publication of a research article and are associated with the datasets underlying that article. The Gates Foundation policy is that “data underlying the published research results be immediately accessible and open.” The Howard Hughes Medical Institute (HHMI) policy is focused on the reproducibility of published results. It states that its grantees must “make data…that are integral to the publication available to other academic and nonprofit scientists for research purposes…on reasonable terms…unless [they] can readily be generated…or…obtained from third parties on reasonable terms.” In such cases, data that are not associated with a publication appear to be exempt from data-sharing, which makes sense insofar as there are no published findings to reproduce. Of course, given what is known about publishing bias and the failure to publish null results in many fields, this may be a significant limitation of these policies.
In other cases, publishers have mandated data deposit. PLOS created an early such mandate, insisting that data underlying all articles it published going forward would be shared. At the time, David Crotty praised PLOS in The Kitchen for this “bold move” in creating such a “forward-looking policy.” Many other publishers have since followed suit. In some cases, journal policies are structured to steer deposits towards data repositories affiliated with the publisher, while in other cases they are structured to steer at least some deposits towards data communities, as described below.
These approaches connect datasets to articles that have been brought through the editorial and peer review process of their journals. As a result, they provide a kind of “filter” for what data will be deposited and preserved, with the editorial and peer review process for the journal publisher serving by proxy as the curatorial mechanism for the datasets.
These article-centric approaches can be contrasted with the strategies being pursued by some other publishers (or, if you prefer, workflow and analytics companies). If the article as the principal artifact of the research project is to give way to a series of research components, such as dataset, protocol, code, and preprint, in addition to the final article of record, today’s publishers would be interested in “publishing” the entire set of research components. Some publishers have invested in repositories of their own. This vision is one way of understanding the extensive workflow tool investments of Elsevier (with the data repository that it has branded “Mendeley Data”) and also Digital Science (with figshare).
The community vision
The second vision is far less well capitalized but seems to be more research-centric. This vision has been well described by my colleagues Danielle Cooper and Rebecca Springer: “A data community is a fluid and informal network of researchers who share and use a certain type of data.” The point of a data community is to reuse the research data of colleagues and others in a field of common interest, with the goal of building upon one another’s work.
Many data communities are instantiated in repositories. For example, it is the community norm for the researchers working on fruit fly genetics deposit their data into FlyBase. In these communities, there are clear standards and expectations for format, metadata, and so forth. Data community members may typically share, discover, and reuse a higher share of the data they gather through their repositories, in a trusted peer setting, than other researchers do through funder mandates or through the publication model. Data communities are at the same time less invested in open sharing, and not all their repositories are open access. In a data community, the shared interest of the community provides implicitly for curation.
In some cases, funders have spearheaded the development of and researcher participation in data community repositories. NIH has an array of data sharing policies that cover different types of projects and datasets, with some 86 repositories for many different data communities available for use. NSF’s policy states that “Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants.” For example, the NSF has funded Duke and Penn State universities to create MorphoSource, a data community for high fidelity 3D data on morphological phenomes.
Curiously, data community repositories have not yet benefited widely from shared infrastructure. In the preprints landscape, SSRN and the Center for Open Science have each created infrastructure that has been deployed to create numerous preprint “servers.” Shared infrastructure might similarly be an enabler for data communities.
That said, the barriers facing emergent data communities are not exclusively technical. Instead, as we see in the cases of spinal cord injury, literary sound recording, and zooarchaeology, the most difficult challenges are reaching a critical mass of contributions and finding sustainable sources of funding allowing for these communities to be maintained over time. Learned societies have a variety of interests and opportunities to foster data sharing and some may be unusually well positioned to facilitate and support data communities.
A model that is in some ways related to data communities is the data journal. The “articles” in data journals typically provide an overview of data collection methodology, sometimes along with notes about how the data might be used, and many data journals offer peer review of submissions. They also provide the means for accessing the actual data themselves, sometimes through a link to a repository and in other cases by simply publishing data tables. Some believe that data journals provide an incentive for data sharing because they can be claimed as “peer reviewed journal articles” on a CV, and in some cases these data journals have obtained notable journal impact factors, but whether hiring, tenure, and promotion processes have valued data “articles” equally to traditional research articles. Nature’s Scientific Data provides an overview of the rationale for why to publish in a data journal. Because data journals are designed to enable discovery and reuse of data for particular field-specific communities, they represent an alternative type of data community model.
Each of these two visions is fundamentally about integrating data into a workflow, whether that be a research community workflow or a publishing workflow. Neither is primarily intended to create a static “collection” of research data. But I also want to acknowledge that several important research data efforts initially viewed data through a collections prism. By this, I mean efforts where the number of datasets being deposited was a key indicator of success, with minimal curatorial support. While these data collections initiatives will not all disappear, over time those that thrive will become more connected with a research community, with a publishing workflow, or perhaps in some cases with both.
Some collections models have been cross-institutional in nature. Several of them, as my colleague Rebecca Springer and I recently assessed, are growing into real businesses. In some cases, these repositories are being connected into the publication workflow, but in many cases they enable research dataset deposit independent of any broader project or ultimate publication.
Within individual academic institutions, there have been a variety of institutional efforts to build data collections, whether piggybacking on their institutional repositories or through more dedicated data repositories. MIT notes that with respect to dataset deposit in its dspace repository, it is “recommended that datasets be documented sufficiently so that other knowledgeable researchers can find, understand and use the data.” By contrast, Oregon State University has a review and approval process for dataset deposits and offers options for the local deposit of research datasets. At the same time, enterprise grade cloud storage options have supplanted some of the demand for preservation-level solutions among university research groups. Data librarianship has developed steadily, such that a recent Ithaka S+R analysis found that nearly three quarters of R1 universities have at least one data librarian, while many have more than one, and even as many as ten. Today, there is some evidence that first generation efforts to create institutional collections of datasets are giving way to more sophisticated curatorial services such as those offered through the Data Curation Network.
These approaches are not exactly equal and they deserve to be examined comparatively and scrutinized both from a scientific and strategic perspective. It appears that data collections are fading, at least comparatively, as datasets will increasingly be connected with research or publishing workflows or data communities.
It is clear that the reproducibility objective of the “publications vision” has significant merit in helping to ensure the integrity of the scientific record, and it appears that the “community vision” can be invaluable in advancing scientific and scholarly progress, at least in certain fields, through data reuse. While the two visions are not entirely mutually exclusive, the question clearly arises: Do we need two separate types of data sharing going forward?
Publishers have at least some modest interest in advancing the publications vision. Using the article as the basis for data sharing reinforces their value proposition. Building an entire research workflow that allows the article to give way to dataset, protocol, code, and other research elements without sacrificing a publisher’s value proposition may be strategically sound. Tremendous energy and investment is being directed to support the publications vision.
By contrast, data communities are difficult to create. But when successful they appear to be far more valuable to certain kinds of scientific progress. What more can be done to facilitate their development?
Funders may play a substantial role in addressing this question going forward. The split in approaches among funders mentioned above, with Gates and HHMI mandating the publications vision and NIH and NSF advancing the community vision, is notable. It remains to be seen whether any funders will choose to provide ongoing support for data communities in their areas of interest — or whether in treating them like start-up projects they will capitalize them and then leave them to struggle to sustain and maintain their work. In this respect, the choices made by the EU in what it terms the European Open Science Cloud will be especially interesting to watch in the coming years.
I thank Adam Chesler for inviting me to present at AIP’s library advisory committee about research data and the participants in that meeting whose robust discussion on an early presentation of this taxonomy improved it considerably. I also thank Danielle Cooper, David Crotty, Kevin Guthrie, Shana Kimball, Kimberly Lutz, Tim McGeary, and Oya Rieger for comments on a draft of this piece.