In recent years, mechanisms for sharing and preserving research data have grown considerably. But the landscape is crowded with a number of divergent models for data sharing. And because these divergent approaches to research data sharing are poorly distinguished in much of the discourse, it can be a confusing landscape. Some are driven by the needs of science, some by business strategy. Today, I propose that two fundamentally competing visions are emerging for sharing research data. 

Vasily Kandinsky, Landscape with Two Poplars, 1912, Art Institute of Chicago.

The publications vision

One vision is highly publisher-centric. It involves efforts to connect datasets to the publications with which they are connected. The publications vision can take several forms. First, it is mandated by many funders and publishers, , sometimes out of an interest in reproducibility. Second, it is being developed by publishers with a strategic view of how the publication system is evolving. 

Research funders have a substantial interest in data sharing. Ensuring the reproducibility and replicability of published results is a vital priority, one that clearly deserves greater attention and investment. Building in some cases on what they perceive to have been their successes in driving the open access movement through deposit mandates, several major funders have been initiating data deposit mandates. 

When associated with reproducibility, funder data deposit mandates are sometimes triggered by the publication of a research article and are associated with the datasets underlying that article. The Gates Foundation policy is that “data underlying the published research results be immediately accessible and open.” The Howard Hughes Medical Institute (HHMI) policy is focused on the reproducibility of published results. It states that its grantees must “make data…that are integral to the publication available to other academic and nonprofit scientists for research purposes…on reasonable terms…unless [they] can readily be generated…or…obtained from third parties on reasonable terms.” In such cases, data that are not associated with a publication appear to be exempt from data-sharing, which makes sense insofar as there are no published findings to reproduce. Of course, given what is known about publishing bias and the failure to publish null results in many fields, this may be a significant limitation of these policies. 

In other cases, publishers have mandated data deposit. PLOS created an early such mandate, insisting that data underlying all articles it published going forward would be shared. At the time, David Crotty praised PLOS in The Kitchen for this “bold move” in creating such a “forward-looking policy.” Many other publishers have since followed suit. In some cases, journal policies are structured to steer deposits towards data repositories affiliated with the publisher, while in other cases they are structured to steer at least some deposits towards data communities, as described below.

These approaches connect datasets to articles that have been brought through the editorial and peer review process of their journals. As a result, they provide a kind of “filter” for what data will be deposited and preserved, with the editorial and peer review process for the journal publisher serving by proxy as the curatorial mechanism for the datasets. 

These article-centric approaches can be contrasted with the strategies being pursued by some other publishers (or, if you prefer, workflow and analytics companies). If the article as the principal artifact of the research project is to give way to a series of research components, such as dataset, protocol, code, and preprint, in addition to the final article of record, today’s publishers would be interested in “publishing” the entire set of research components. Some publishers have invested in repositories of their own. This vision is one way of understanding the extensive workflow tool investments of Elsevier (with the data repository that it has branded “Mendeley Data”) and also Digital Science (with figshare). 

The community vision

The second vision is far less well capitalized but seems to be more research-centric. This vision has been well described by my colleagues Danielle Cooper and Rebecca Springer: “A data community is a fluid and informal network of researchers who share and use a certain type of data.” The point of a data community is to reuse the research data of colleagues and others in a field of common interest, with the goal of building upon one another’s work. 

Many data communities are instantiated in repositories. For example, it is the community norm for the researchers working on fruit fly genetics deposit their data into FlyBase. In these communities, there are clear standards and expectations for format, metadata, and so forth. Data community members may typically share, discover, and reuse a higher share of the data they gather through their repositories, in a trusted peer setting, than other researchers do through funder mandates or through the publication model. Data communities are at the same time less invested in open sharing, and not all their repositories are open access. In a data community, the shared interest of the community provides implicitly for curation.

In some cases, funders have spearheaded the development of and researcher participation in data community repositories. NIH has an array of data sharing policies that cover different types of projects and datasets, with some 86 repositories for many different data communities available for use. NSF’s policy states that “Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants.” For example, the NSF has funded Duke and Penn State universities to create MorphoSource, a data community for high fidelity 3D data on morphological phenomes.

Curiously, data community repositories have not yet benefited widely from shared infrastructure. In the preprints landscape, SSRN and the Center for Open Science have each created infrastructure that has been deployed to create numerous preprint “servers.” Shared infrastructure might similarly be an enabler for data communities. 

That said, the barriers facing emergent data communities are not exclusively technical. Instead, as we see in the cases of spinal cord injury, literary sound recording, and zooarchaeology, the most difficult challenges are reaching a critical mass of contributions and finding sustainable sources of funding allowing for these communities to be maintained over time. Learned societies have a variety of interests and opportunities to foster data sharing and some may be unusually well positioned to facilitate and support data communities. 

A model that is in some ways related to data communities is the data journal. The “articles” in data journals typically provide an overview of data collection methodology, sometimes along with notes about how the data might be used, and many data journals offer peer review of submissions. They also provide the means for accessing the actual data themselves, sometimes through a link to a repository and in other cases by simply publishing data tables. Some believe that data journals provide an incentive for data sharing because they can be claimed as “peer reviewed journal articles” on a CV, and in some cases these data journals have obtained notable journal impact factors, but whether hiring, tenure, and promotion processes have valued data “articles” equally to traditional research articles. Nature’s Scientific Data provides an overview of the rationale for why to publish in a data journal. Because data journals are designed to enable discovery and reuse of data for particular field-specific communities, they represent an alternative type of data community model. 

Data collections

Each of these two visions is fundamentally about integrating data into a workflow, whether that be a research community workflow or a publishing workflow. Neither is primarily intended to create a static “collection” of research data. But I also want to acknowledge that several important research data efforts initially viewed data through a collections prism. By this, I mean efforts where the number of datasets being deposited was a key indicator of success, with minimal curatorial support. While these data collections initiatives will not all disappear, over time those that thrive will become more connected with a research community, with a publishing workflow, or perhaps in some cases with both. 

Some collections models have been cross-institutional in nature. Several of them, as my colleague Rebecca Springer and I recently assessed, are growing into real businesses. In some cases, these repositories are being connected into the publication workflow, but in many cases they enable research dataset deposit independent of any broader project or ultimate publication. 

Within individual academic institutions, there have been a variety of institutional efforts to build data collections, whether piggybacking on their institutional repositories or through more dedicated data repositories. MIT notes that with respect to dataset deposit in its dspace repository, it is “recommended that datasets be documented sufficiently so that other knowledgeable researchers can find, understand and use the data.” By contrast, Oregon State University has a review and approval process for dataset deposits and offers options for the local deposit of research datasets. At the same time, enterprise grade cloud storage options have supplanted some of the demand for preservation-level solutions among university research groups. Data librarianship has developed steadily, such that a recent Ithaka S+R analysis found that nearly three quarters of R1 universities have at least one data librarian, while many have more than one, and even as many as ten. Today, there is some evidence that first generation efforts to create institutional collections of datasets are giving way to more sophisticated curatorial services such as those offered through the Data Curation Network

Looking ahead

These approaches are not exactly equal and they deserve to be examined comparatively and scrutinized both from a scientific and strategic perspective. It appears that data collections are fading, at least comparatively, as datasets will increasingly be connected with research or publishing workflows or data communities. 

It is clear that the reproducibility objective of the “publications vision” has significant merit in helping to ensure the integrity of the scientific record, and it appears that the “community vision” can be invaluable in advancing scientific and scholarly progress, at least in certain fields, through data reuse. While the two visions are not entirely mutually exclusive, the question clearly arises: Do we need two separate types of data sharing going forward? 

Publishers have at least some modest interest in advancing the publications vision. Using the article as the basis for data sharing reinforces their value proposition. Building an entire research workflow that allows the article to give way to dataset, protocol, code, and other research elements without sacrificing a publisher’s value proposition may be strategically sound. Tremendous energy and investment is being directed to support the publications vision.

By contrast, data communities are difficult to create. But when successful they appear to be far more valuable to certain kinds of scientific progress. What more can be done to facilitate their development? 

Funders may play a substantial role in addressing this question going forward. The split in approaches among funders mentioned above, with Gates and HHMI mandating the publications vision and NIH and NSF advancing the community vision, is notable. It remains to be seen whether any funders will choose to provide ongoing support for data communities in their areas of interest — or whether in treating them like start-up projects they will capitalize them and then leave them to struggle to sustain and maintain their work. In this respect, the choices made by the EU in what it terms the European Open Science Cloud will be especially interesting to watch in the coming years.  

 

I thank Adam Chesler for inviting me to present at AIP’s library advisory committee about research data and the participants in that meeting whose robust discussion on an early presentation of this taxonomy improved it considerably. I also thank Danielle Cooper, David Crotty, Kevin Guthrie, Shana Kimball, Kimberly Lutz, Tim McGeary, and Oya Rieger for comments on a draft of this piece.

Roger C. Schonfeld

Roger C. Schonfeld

Roger C. Schonfeld is the vice president of organizational strategy for ITHAKA and of Ithaka S+R’s libraries, scholarly communication, and museums program. Roger leads a team of subject matter and methodological experts and analysts who conduct research and provide advisory services to drive evidence-based innovation and leadership among libraries, publishers, and museums to foster research, learning, and preservation. He serves as a Board Member for the Center for Research Libraries. Previously, Roger was a research associate at The Andrew W. Mellon Foundation.

Discussion

7 Thoughts on "Two Competing Visions for Research Data Sharing"

Roger, thanks for the overview. Re. your comment that “in the preprints landscape, SSRN and the Center for Open Science have each created infrastructure that has been deployed to create numerous preprint ‘servers.’ Shared infrastructure might similarly be an enabler for data communities.”

arXiv permits data sets to be deposited. I think however that it would be good for an arXiv (or for that matter, an SSRN) to require or strongly encourage deposit of data, or perhaps better yet, a link to shared data sets referenced in a preprint.

My guess is that peer reviewers do not always address, at least thoroughly, the question whether researcher data indeed supports the conclusions of a scientific article. (Has this been studied?) If so, this is entirely understandable. After all, the glut of publishing makes it difficult to do thorough peer review. This is where preprints can come into play, in terms of providing a forum in which competing research groups can post (in preprint format) an analysis of the research data used to support scientific or social scientific claims. Data sets archived with preprints, or to which preprints point if they reside on distinct data servers, can always be absorbed into traditionally published journal articles and in this sense, don’t compete with the latter format.
Preprints provide no panacea for the peer review crisis (too many pubs, too few reviewers) but could go some distance toward serving as a de facto peer review.
This is not to suggest that preprints that evaluate the work of others should replace traditional peer review tied to traditionally published journal articles, just that these formats serve distinct functions in the scholarly communication ecosystem and each has long historical antecedents in doing so.
On a distinct front, is it accurate to say that most attention about issues of replicability of data has emanated from the social sciences and, if so, why? I think of course of the work of Brian Nosek.

Brian, Thanks for your comments and questions.

Linking data to the preprint would be an interesting step forward, although I’m not aware of any preprint initiatives that mandate such deposit just yet.

As to issues of reproducibility and replicability, Brian Nosek definitely deserves outsize credit for his leadership in these areas, especially for the field of psychology but also well beyond. As the National Academies report linked early in my piece makes clear, this is an issue that touches all fields, well beyond the social sciences, and where great improvements are needed. Data availability is but one piece of this complicated puzzle.
https://www.nap.edu/catalog/25303/reproducibility-and-replicability-in-science

“Curiously, data community repositories have not yet benefited widely from shared infrastructure. In the preprints landscape, SSRN and the Center for Open Science have each created infrastructure that has been deployed to create numerous preprint “servers.” Shared infrastructure might similarly be an enabler for data communities.”

Roger, you are regularly anticipating a step ahead! We just released our first instance of a new product “OSF Collections” that aims to support emergence of data communities like you describe. Here is the first instance: http://osf.io/collections/metascience/ The general product release is coming soon.

Brian, are you anticipating at some point close integration of preprints and the data product you mention here?

Yes indeed. In the back-end, preprints, data, materials, and registrations are already integrated. Our focus at present is improving the front-end submission and discovery interfaces. Essentially, there are three flavors: Papers/preprints (http://osf.io/preprints), Collections, and Registries. The collections and registry interfaces will be like preprints — brandable, community-operated services. After those interfaces, we will be working on improving metadata and integrated workflows across the interfaces and research lifecycle that they support.

Thanks Brian (Nosek) – looking forward to seeing how this develops!

Thanks Roger—this is an important topic. Three questions: (1) Are you aware of any data on uptake patterns? That is, how much are datasets in demand, in what fields, etc.? We all have this vague vision that datasets are important, but how are researchers using them (and are they using them the way we think they should, and if not, why)?; (2) Do you think your first grouping should be further divided into “useful” and “not so useful” categories? At least some community datasets have usability standards that make sense for the community—fields, formats, etc., as you note. For the rest, quality and utility seems to vary greatly. In fact, most of the study “datasets” I’ve seen have been borderline useless—just the numbers behind the charts, for instance. Missing is the underlying data (and methods and calculations) used to derive the numbers behind the charts. Is there any movement afoot to help clarify or set some expectations for what we mean by “data”? (And beyond this, maybe even best practices for formatting/access?—e.g., use Excel, no macros, put the key and assumptions on worksheet 1 of your workbook, label your worksheets to match the figures in your paper, etc.); and (3) Are we advocating for a future of either/or? There are also tools and services out there (like LabKey) whose business is to deploy sophisticated data management frameworks that enable better data sharing and integration. It’s like having a community dataset without the community. Otherwise, widespread data use/reuse is a pipedream if we’re relying on scanning across thousands of disparate Excel spreadsheets, each with its own format, units, etc. Is there any discussion happening about common data management frameworks?

Comments are closed.