Research data is/are getting a lot of airtime at the moment. 2020 is the STM Association’s ‘Research Data Year’. The upcoming Peer Review Week focuses on ‘Trust’, which for articles must often involve open data. There’s also been a flurry of action (or calls for action) from stakeholders, including CODATA’s Beijing Declaration on Research Data and global research institutions’ Sorbonne Declaration.
These declarations and initiatives largely focus on ensuring that research data are FAIR: Findable, Accessible, Interoperable, and Reusable. The FAIR data principles are the current goalposts for promoting open research data, and efforts are thus focused on a) ensuring that individual datasets have comprehensive, machine-readable metadata (a link to the protocol used to collect the data, details of the instruments used, the license under which the data were released), and b) developing a network of FAIR compliant repositories to host all these datasets.
The FAIR principles are the community consensus answer to the ‘How’ question of data sharing, in that they describe best practice for how to share a particular dataset. Community consensus about anything is very welcome, but by themselves, the FAIR principles don’t have the leverage to bring more data into the public sphere and thereby achieve the manifold benefits of an open research data ecosystem.
For that, we also need a consensus answer to the ‘What’ question: for a given study, what datasets do the researchers need to share? This question is of fundamental importance because it underpins data sharing policies.
First, to comply with a policy, researchers must be clear what datasets they need to deposit, which can be difficult to determine when their data are complex or pass through multiple stages between raw and analysis-ready data. Second, to enforce the policy, the stakeholder (typically a journal, funding agency, or research institution) has to be clear about what datasets should have been shared, so it can compare that to the list of datasets the authors actually have shared.
To state the obvious, not all datasets are equal. Consider a dataset collected in the lab on a weekend, written into a notebook and promptly forgotten by the researcher. Here, there is no practical way for the stakeholder to know that dataset ever existed, and thus no mechanism to prompt the researchers to share it. Other datasets are collected over years or decades by researchers from multiple institutions who are in turn funded by different agencies, such that it’s almost impossible to know which data policy applies or when the data should be released.
Funders, journals, and institutions alike need a common focal point for the ‘What’ question – a fundamental unit of research effort where we ask “have all the data associated with this unit of effort been made available?”
The obvious ‘fundamental unit’ of data sharing is the research article:
- It is getting easier to identify the datasets underlying a given article.
- Research articles reflect data that have already been collected (unlike Data Management Plans which describe future data collection efforts).
- Perhaps most importantly, articles are intended to be published, and journal stakeholders can withhold publication until the data have been shared.
This concept of a fundamental unit is borrowed from evolutionary biology, where researchers have discussed the ‘fundamental unit’ of natural selection for decades. Does selection mostly act on individual genes, individuals, or on groups?
The idea of a gene as the main target of selection is seductive (c.f., the selfish gene), but like individual datasets, genes are closely integrated with other genes, and have little meaning by themselves. To take the analogy further, some genes (or datasets) are junk and others essential, and it is surprisingly hard to tell which is which. Promoting the sharing of individual datasets will therefore lead stakeholders to overlook corollary datasets that are either essential for interpreting the main dataset or contain unique and valuable information on their own.
A group of individuals could plausibly be the ‘unit’ of natural selection, but this idea runs into trouble when it is hard to define where one group ends and another begins. The parallel here is using research grants or the annual output of a research lab as the unit for data sharing, as datasets are often the product of multiple grants or are collected over multiple years. Some datasets are so ephemeral that they barely register before disappearing. Moreover, unlike the data in an article, data collected for a grant or in a particular lab have no unifying analysis to order and structure them. Ask a PI to list all the datasets produced in their lab in a particular year and they’ll typically default to listing the data underlying their recent articles, with the unpublished or otherwise obscure datasets being forgotten.
The individual is the most widely accepted ‘fundamental unit’ of evolution. An individual is the product of all of its genes working together, and, for the most part, it is easy to see where one individual stops and the next begins. These traits also apply to research articles, in that articles represent a coherent grouping of datasets that supports an analysis approach, and it is not too difficult to define which datasets are associated with an article, even when some of the data are being re-used from previous studies.
In addition, unlike groups or genes, individuals have a defined lifespan, and at the end one can sum up their contributions to the next generation. Articles have a similar feature, in that the key moment for data sharing is just before acceptance for publication – the contents of the article are set and the datasets defined, and the authors can be pressured to share the data before the article moves out into the public sphere.
Even putting the genetic analogy aside, the above illustrates the need for a discussion about both the How and the What of data sharing. Choosing a fundamental unit for the What allows stakeholders to align their policies on what needs to be shared (e.g., all of the data associated with an article) and when (e.g., at publication), so that we can all work toward the same goal.
22 Thoughts on "Articles Are the Fundamental Unit of Data Sharing"
Great post Tim! I think the point you make regarding “articles are intended to be published” is key.
Researchers publish articles when they are ready to share their work with the world. At this point they are inviting the community to discuss, validate, and check the reasoning they present in their articles. Since the data supporting their assertions represent the evidence upon which the authors are making their (potentially novel) assertions, it’s important for the community to have access to those data, so they can make an informed judgement as to whether the assertions made in the article are (on the balance of the evidence presented) likely to be true.
So at article publication, authors are ready to share their research and the underlying data are key in allowing others to properly comprehend the findings. Article publication is therefore the perfect time to ensure that those underlying data are shared in the FAIRest way possible.
Hi Varsha, thanks for this comment – I completely agree. The next jump beyond that would be to treat the data the same way as we treat other aspects of the article, and have it assessed and improved through the process of peer review, or have shoddy data be the grounds for rejection.
There’s additional effort required to promote data sharing for 100% of submissions as opposed to the 10-30% that get accepted, but that’s where greater automation comes in (e.g. DataSeer).
I appreciate the post. Defining the article as the fundamental unit of data sharing has the virtue of being tractable and actionable in the current landscape of scholarly communication. But, I think focusing attention on the article will fail to address the primary aim of open science — making it easier to assess and establish the credibility of research evidence and claims.
The key flaw in an exclusive focus on the article is that the pervasive problem of publication bias is ignored. Researchers are incentivized to ignore negative results, and there is substantial evidence that the literature is systematically skewed toward reporting studies that “work.” Data sharing for only research that is reported in the article does not address this challenge. And failure to develop policies and pathways that incentivize the reporting of all studies and underlying data will exacerbate overconfidence in the published literature.
The research culture is not yet prepared to adopt it, but the fundamental unit for reporting and transparency should be the preregistered study — regardless of whether it ultimately ends up in a paper. I believe that it is in our collective interest to embrace that target, even though we are much farther from it, so that we do not mistake the excellent efforts to promote data sharing at the point of publication as the end goal but rather a feasible and productive step along the way.
Thanks Brian, really interesting comment. Since the data haven’t yet been collected when they’re published, when do you picture the ‘moment of enforcement’ for pre-registered reports? Would it be when the authors submit the accompanying results article?
Moreover, while pre-registered reports are a great idea for fields where each study should be pre-registered and approved by an ethics board (e.g. medicine & other live animal research), they seem less practical for fields where one can just start collecting data on a whim. Fields that are lab-based (e.g. chemistry, life sciences, or genetics) or that are otherwise based on small scale experiments would probably struggle to make preregistered reports standard practice.
Hey Tim. I don’t think there need be a singular enforcement date. For some projects, there’s little risk (or even an advantage) for born-open or open as soon as viable. For others, upon submission makes sense, if only for permissioned access should reviewers wish to examine data and code. And, for others, embargo periods may be justified based on the present reward structures for publication priority (whatever the faults of such reward systems). Our approach is to provide flexibility with accountability being served by transparency. That is, if all studies are registered, and workflows facilitate reporting of outcomes, then we can move norms toward better reporting by making it plain when researchers are out of alignment with community norms or the funder, journal, or institutional policy.
On the latter point, I agree right now and disagree in the longer run. One of our projects with DARPA started pushing the boundaries of preregistration in high-throughput research activities. Templates, effective interfaces, and integration with data acquisition systems can make preregistration (and reproducibility of protocols) extremely efficient. Increasing the efficiency and applicability of preregistration across research activities is a priority area for us.
I agree. And the other challenge of making the research publication the fundamental unit of data sharing is that we’re still dependent on the scholarly literature & concomitant metadata – still often owned by commercial entities – to monitor engagement. To me, the beauty of new forms of self-‘published’ output (data, code, protocols) is that both content and metadata are community-owned.
Are you certain that if data, code, and protocols become coin of the realm (similar to journal articles) that their availability, review, monitoring and validation won’t also end up being monetized by those same commercial entities? The existing commercial forces in the market seem pretty intent on adapting and absorbing every aspect of the researcher workflow and creating lock-in (https://scholarlykitchen.sspnet.org/2018/01/02/workflow-lock-taxonomy/), so I wouldn’t depend on openness of content or metadata to keep things out of their hands.
“journal stakeholders can withhold publication until the data have been shared” – interesting Twitter discussion this topic just yesterday (copied below).
Richard Wynne – Rescognito
Highly disappointing response from the
editorial team when asked to facilitate communication with unresponsive authors for data sharing: “this is not something we offer.”
“Data available on request” is just an empty formality rather than supporting #OpenScience?
Good article. I wonder if there is not room for a repository of Lab Log Books?
1) When data collection is expensive and requires much equipment and logistical support, would it not be FAIR to put a price on that data if collected through private means not public funds?
2) Would a valid alternative to sharing data then be a thorough description of the data, data sources, data selection, data collection, data preparation and data processing methods sufficient to determine the quality of the data and its adequacy for the research purpose?
3) Sometimes data’s only purpose is to validate a new method. The method is essential, not the data which could be described as above. Should the data still be shared then, or would the scientific community benefit from having new data to further validate the method?
Hi Jean-Luc – I agree that there might be a distinction between private and public data while it’s being collected and used by the authors, but once it’s part of a research manuscript that intended to be published, the data must be publicly available. Withholding the data at that point makes as much sense as withholding the figures.
For your second point, I think that’s the trap we fell into over the past few decades: a decent description of the methods covers everything that the raw data would. Just go search for e.g. #pruittdata on Twitter to disabuse yourself of this notion.
When we published Methods articles in Molecular Ecology Resources we always insisted that authors share their example and test data. What better way for others to get to grips with what the method/analysis does than re-running it themselves?
Thanks for this thought-provoking post!
I completely agree with you that in many cases – particularly when scholars are REQUIRED to share their data by a funder or publisher – the article is the fundamental unit of sharing (and the fundamental discovery mechanism).
But this doesn’t seem to be the case for some of the most well-established data sharing communities – genomics, crystallography, earth satellite images, etc. Publication may still be the workflow event that triggers sharing and an important discovery mechanism, but these communities have developed their own fundamental data sharing units that allow reference and reuse outside the publication ecosystem. This is partly due to the nature of the data, of course, but I’d argue that it’s also a marker of data community maturity, because the community has developed its own norms around sharing data in a way that is most useful to their research collectively. Not all data sharing can work in this way, but focusing exclusively on the article might get in the way of spotting opportunities to support the growth of data sharing activities that are adjacent, rather than attached, to the publication ecosystem.
Rebecca Springer, Ithaka S+R
To be fair, I might argue that data sharing in the Genomics community is a consequence of the early existence of GenBank (https://en.wikipedia.org/wiki/GenBank) and the actions taken by journals requiring a GenBank Accession Number for any DNA sequence that appears in a published paper, which have largely driven its growth.
Absolutely David – similarly with the crystallography CSD. But in these cases the DNA sequence and the crystal structure are the fundamental units of data sharing, even if publication is the triggering event for sharing them – right?
I also agree that journal requirements can play an important role in strengthening/maintaining data sharing community norms, though I can think of at least one example where that tactic failed when the journal requirements were put in place too soon, before there was a sufficient culture of data sharing in the field.
Yes, the journal article provides a point of leverage, which is why policymakers have been so keen to work with publishers on getting things like open data off the ground. Publication as the point of openness also puts an end to the argument that the data needs to be kept secret so it can be exploited further by the researcher before letting others see it. Here they’re making a conscious decision to make their research results public. As you note, ideally this would lead to cultures where data are no longer seen as something owned and secret, and this to me seems a logical first step.
Indeed, or perhaps ironically here, the success of the human genome project was because the labs agreed to share and release sequence data as they were collected, long (years) before and independent of publication: https://web.ornl.gov/sci/techresources/Human_Genome/project/share.shtml Likewise, there are many other cases where data are collected and shared in real time (after validation) independent of and separate from publication that should be strongly encouraged and are already happening (planetary missions; astronomical observations, many real time Earth observations including temperature records, COVID data, etc). This is already and should be an “and” discussion of which articles are one important part. For the “what,” this is a key question, and following the leading domain repositories can help greatly (as it has in genomics, where there are guidelines developed by leading repositories and then followed by journals on what to deposit, reference genomes, vs raw reads, etc.). Including the repositories in the conversations is key as they have in many cases already developed leading practices.
Good points! As noted elsewhere in these comments, the 1992 Human Genome Project policies came along about 10 years after GenBank, which really set the tone for how DNA sequence should be handled (and in which journals requiring an accession number as a condition of publication played a major role). Perhaps also worth noting that regardless of this policy, great efforts were made by those involved in sequencing the genome to patent DNA sequences and lock them up from further unpaid reuse. The much-maligned (and appropriately so) Jim Watson resigned as director of the HGP in protest over the patenting of sequences.
Regardless, one has to start somewhere to build that culture of data sharing. Where it is not present, it seems to me that tying it to publication, an undeniable career need for a researcher, provides the lever to start the process of getting us to a world where this is the norm.
Hi Brooks, thanks for these comments. The data I had in mind when writing this come from the ‘long dark tail’ – the vast ocean of medium to small sized datasets that are produced by researchers that aren’t in any grand collaborations and which may or may not end up in a published article.
As David mentions, data sharing norms have generally been communicated to researchers by firm journal policies, and community expectations can build from there. For the long dark tail the only clue that these datasets exist is their inclusion in an article, hence my recommendation that we focus on ensuring that all the data associated with articles get shared.
I absolutely agree that this isn’t a perfect solution, but right now most researchers and stakeholders are still a long way from high quality open data, and they’re not going to get there in a single jump. We need an interim goal that can inculcate community driven open research data across all fields, and not just those that contain big scale projects like the HGP or the LHC.
Thanks for the thought-provoking article! Defining the journal article as the most viable unit for sharing data seems appealing at first, but I would argue this is actually oversimplified and would lead to losing a lot of the potential of research data. The proposed framework supports reproducibility of the presented analyses, but reusability is greatly diminished. I agree with Rebecca in that it is a sign of community maturity to have developed standards and practices of data sharing, and I would add that this has led to facilitating scientific discovery in these fields. Collating and linking datasets by methods and fields and annotating them in the most suitable way strongly facilitates reuse. In contrast, sharing data in an article-oriented way presses into a package data which will hardly ever be reused in that combination, while making integration more difficult. For example, data from a study which analyses brain activity as a function of personality traits would be most reusable if the imaging data are integrated with other, similar data, which can then serve to investigate unrelated questions. At the same time, the psychological assessment data can best be reused in an ecosystem of similar data, again allowing to answer different questions or similar questions with larger cohorts. Of course, reproducibility is an important goal, and clearly referencing the datasets used (as well as possibly deploying data packages and containers for code and data) will contribute to that, but this can be achieved without losing potential for data reuse.
In addition, the assumption of the article is that the article publication is to remain the fundamental output unit of research. However, this is already changing, and given the incredible amount of articles, which are often hardly read, and the speeding of discovery, it seems implausible to me that it would remain so. Already now, micropublications and versioned articles have begun to erode the article as a “clearly defined unit”. I see the unit of discovery rather in the observation or finding (for exploratory research) or the tested hypothesis (for confirmatory research) than in the article. While these might be new developments, the workflows in many labs already show that datasets are generated over long periods of time, added to, and used as a basis for multiple publications, thus also precluding a simple mapping of publications to datasets.
Lastly, I think there is not one single time at which research data should be shared. Definitely sharing at article publication is an option which has many advantages, but in some cases it might be possible to share them earlier, and e.g. for long-term studies communities need to develop their own standards which might include sharing data earlier, probably in a versioned way. Vice versa, not sharing all data at publication and rather opting for an embargo might not be best practice, but in the current research system it can be understandable, and for now needs to be accommodated (which of course does not preclude possible obligations to share all data with reviewers).
Thanks Evgeny – these are very good points.
A focus on the article as the fundamental unit of data sharing is certainly oversimplified and other potential research data will be missed. However, it’s light years ahead of the current situation, where there’s no real focus and almost all research data are missed. We have to start somewhere – the perfect is the enemy of the good and all that.
I also don’t agree that sharing data alongside an article necessarily diminishes re-usability. The key is to give researchers advice on metadata, formatting, and the most suitable repositories, which in turn ensures that the data are findable and (hopefully) in good enough shape to be combined with similar datasets. This is the approach we’re taking with DataSeer, and we’re looking forward to working with repo’s and researchers to hone our advice on sharing various data types.
There certainly are alternative publication formats out there, and some are rising in popularity. For what it’s worth, I count preprints as ‘articles’, and preprint servers are certainly in a position to promote or require open data before a preprint becomes public. More broadly, an alternative system that replaces the ~ 2 million articles published each year isn’t anywhere on the horizon at the moment.
I think you also might be missing my point when you say that “there is not one single time at which research data should be shared”. This is obviously true, and some researchers continuously share their data. However, most researchers passively resist sharing their data, and have to be pulled through the sharing process by the nose. That’s why I’m suggesting that the article is where we pick as a starting point to say “right – these are the data sets you have to share, and your article isn’t going anywhere until you’ve shared them”. Once we’ve got the community used to that (maybe 10 years from now?), we can start pushing for other data sharing routes.
We were thinking along similar lines in this policy proposal for the US Federal Government that has now gone live on the Day One Project website: