Research data is/are getting a lot of airtime at the moment. 2020 is the STM Association’s ‘Research Data Year’. The upcoming Peer Review Week focuses on ‘Trust’, which for articles must often involve open data. There’s also been a flurry of action (or calls for action) from stakeholders, including CODATA’s Beijing Declaration on Research Data and global research institutions’ Sorbonne Declaration.
These declarations and initiatives largely focus on ensuring that research data are FAIR: Findable, Accessible, Interoperable, and Reusable. The FAIR data principles are the current goalposts for promoting open research data, and efforts are thus focused on a) ensuring that individual datasets have comprehensive, machine-readable metadata (a link to the protocol used to collect the data, details of the instruments used, the license under which the data were released), and b) developing a network of FAIR compliant repositories to host all these datasets.
The FAIR principles are the community consensus answer to the ‘How’ question of data sharing, in that they describe best practice for how to share a particular dataset. Community consensus about anything is very welcome, but by themselves, the FAIR principles don’t have the leverage to bring more data into the public sphere and thereby achieve the manifold benefits of an open research data ecosystem.
For that, we also need a consensus answer to the ‘What’ question: for a given study, what datasets do the researchers need to share? This question is of fundamental importance because it underpins data sharing policies.
First, to comply with a policy, researchers must be clear what datasets they need to deposit, which can be difficult to determine when their data are complex or pass through multiple stages between raw and analysis-ready data. Second, to enforce the policy, the stakeholder (typically a journal, funding agency, or research institution) has to be clear about what datasets should have been shared, so it can compare that to the list of datasets the authors actually have shared.
To state the obvious, not all datasets are equal. Consider a dataset collected in the lab on a weekend, written into a notebook and promptly forgotten by the researcher. Here, there is no practical way for the stakeholder to know that dataset ever existed, and thus no mechanism to prompt the researchers to share it. Other datasets are collected over years or decades by researchers from multiple institutions who are in turn funded by different agencies, such that it’s almost impossible to know which data policy applies or when the data should be released.
Funders, journals, and institutions alike need a common focal point for the ‘What’ question – a fundamental unit of research effort where we ask “have all the data associated with this unit of effort been made available?”
The obvious ‘fundamental unit’ of data sharing is the research article:
- It is getting easier to identify the datasets underlying a given article.
- Research articles reflect data that have already been collected (unlike Data Management Plans which describe future data collection efforts).
- Perhaps most importantly, articles are intended to be published, and journal stakeholders can withhold publication until the data have been shared.
This concept of a fundamental unit is borrowed from evolutionary biology, where researchers have discussed the ‘fundamental unit’ of natural selection for decades. Does selection mostly act on individual genes, individuals, or on groups?
The idea of a gene as the main target of selection is seductive (c.f., the selfish gene), but like individual datasets, genes are closely integrated with other genes, and have little meaning by themselves. To take the analogy further, some genes (or datasets) are junk and others essential, and it is surprisingly hard to tell which is which. Promoting the sharing of individual datasets will therefore lead stakeholders to overlook corollary datasets that are either essential for interpreting the main dataset or contain unique and valuable information on their own.
A group of individuals could plausibly be the ‘unit’ of natural selection, but this idea runs into trouble when it is hard to define where one group ends and another begins. The parallel here is using research grants or the annual output of a research lab as the unit for data sharing, as datasets are often the product of multiple grants or are collected over multiple years. Some datasets are so ephemeral that they barely register before disappearing. Moreover, unlike the data in an article, data collected for a grant or in a particular lab have no unifying analysis to order and structure them. Ask a PI to list all the datasets produced in their lab in a particular year and they’ll typically default to listing the data underlying their recent articles, with the unpublished or otherwise obscure datasets being forgotten.
The individual is the most widely accepted ‘fundamental unit’ of evolution. An individual is the product of all of its genes working together, and, for the most part, it is easy to see where one individual stops and the next begins. These traits also apply to research articles, in that articles represent a coherent grouping of datasets that supports an analysis approach, and it is not too difficult to define which datasets are associated with an article, even when some of the data are being re-used from previous studies.
In addition, unlike groups or genes, individuals have a defined lifespan, and at the end one can sum up their contributions to the next generation. Articles have a similar feature, in that the key moment for data sharing is just before acceptance for publication – the contents of the article are set and the datasets defined, and the authors can be pressured to share the data before the article moves out into the public sphere.
Even putting the genetic analogy aside, the above illustrates the need for a discussion about both the How and the What of data sharing. Choosing a fundamental unit for the What allows stakeholders to align their policies on what needs to be shared (e.g., all of the data associated with an article) and when (e.g., at publication), so that we can all work toward the same goal.