On the first day of Open Access Week, Figshare released the 2017 edition of their ‘State of Open Data’ report. The document contains a number of thoughtful pieces from the European Commission, the Wellcome Trust, and Springer Nature (among others), as well as the results of Figshare’s second annual survey of researchers about their experiences with data sharing.
The report as a whole is a useful microcosm on where we currently stand with open data. Large, influential organizations like the European Commission and Wellcome have thrown their weight (and policies) behind the idea that the data underlying scientific research should be available to everyone. By contrast, the response from researchers has been mostly polite indifference: Only 20-30% of respondents claimed to have shared their data ‘frequently’, and just under half claimed either ‘rarely’ or ‘never’.
Despite the optimism of the 2017 survey infographic, only a small portion of the researcher community is actively supportive of open data, especially when it’s their own data that might become open. Overcoming this lassitude and/or resistance is a formidable task, and will require coordinated effort from funders, publishers, and institutions. Mark Hahnel (of Figshare) despairs at the scale of the problem in the 2016 report:
“We are not the first, nor will we be the last to point out that our academic system is fundamentally broken – publication in particular journals assures funding and job prospects; no-one becomes a professor for sharing their data.”
Despite its pessimism, this quote does hint at an answer: the link between data sharing and academic success is realized only when “nobody publishes in particular journals without sharing their data”. The fact that one can currently publish even major discoveries without the supporting data seems absurd, but that’s where we are these days.
A fair proportion of journals have adopted data sharing policies (e.g. 44% of the biomedical journals examined here; 38% here). However, few journals ensure that the datasets are complete and well annotated (e.g. Roche et al’s study), and only a tiny fraction include evaluation of the data in the peer review process. Enforcing data policies is hard work for journals because authors do not spontaneously supply all of their datasets. Instead, the journal has to cross-reference the manuscript with the data provided and chase down all of the missing datasets (this is typically about half). With specialist staff this effort is just about possible for accepted manuscripts; achieving it for all submissions so that reviewers can look at the data is nigh impossible.
Outside the ivory tower, a whole new economy is springing up around data. Some firms amass data on people and then sell access to advertisers who want to reach a particular segment. Others trawl open datasets with sophisticated algorithms in the hope of finding unforeseen (and perhaps profitable) insights. ‘Data is the new oil’ can be heard at seminars and conferences all round the world. A lot of money is being spent on collecting, curating, and storing all these data.
Back in academic publishing, we have seen open access (OA) grow from a fringe concept into a widespread business model, and the major publishers all now embrace and promote OA. Can we hope for the same with open data? The economic forces behind the adoption of OA are clear, authors are highly motivated to publish and the journal receives money for each article. More funders than ever now require authors to direct their papers towards OA journals.
Currently, this is not the case for open data: authors don’t expect to pay the journal for reviewing their datasets, and there is resistance to the idea that authors should pay more than a few hundred dollars for data hosting. Journals could roll the extra costs associated with reviewing data into their APCs or subscriptions, but investing in data sharing seems to make the journal less, and not more, attractive to authors. For example, the shift in submissions from PLOS ONE to Scientific Reports coincides with the former’s adoption of a more stringent data policy. As long as authors see strong data policies as just another obstacle to publishing their work, journals will struggle to commit editorial resources to enforcing those policies.
This situation could change if we shift to a world where papers without (peer reviewed) data are seen as weak and unsupported, as then authors would place a premium on publishing in journals with strong data policies. However, it is hard to see how publishers can hammer this message home without casting doubt on their back catalogue or alienating their more senior authors and editors.
Are there other economic forces that would push publishers to ensure that all their articles are accompanied by the underlying data? In some ways, datasets associated with research papers are highly desirable, as they are collected by trained experts under carefully controlled conditions. If the data have been included in the review process, they should also be well annotated and hopefully error-free. One immediate obstacle to making money with shared scientific data is licensing: many open datasets associated with research articles use the CC-BY-NC license to forestall re-use for commercial purposes.
Another issue is the granularity of research data. The vast consumer behavior datasets collected by Google and Facebook have the individual as the common thread; similarly, the big datasets coming out of cities have common reference points (e.g. locations). This is not the case for the datasets arising from scientific research: most are small and collected to answer a very specific question. There is no conceivable thread joining a dataset on the temperature of penguin eggs to the DNA sequences of 1,000 fruit flies. Of course, either dataset could be combined with other penguin or Drosophila data, but these are unlikely to achieve the scale where they could have some commercial application. This story on data from mosquito surveys in the US sheds some light on the scope of this granularity problem.
Going back to the 2016 Figshare report on Open Data, there’s a notable section on the commercial aspects of open data in the piece by Sabina Leonelli:
The very idea of scientific data as artefacts that can be traded, circulated across the globe and re-used to create new forms of value is indissolubly tied to market logics, with data figuring as objects of market exchange. National governments and industries that have invested heavily in data production (…) are keen to see results. This requirement to maximize returns from past investments, and the urgency typically attached to it, fuels the emphasis on data needing to travel widely and fast to create knowledge that would positively impact human health.
In this vision the data associated with published papers end up in closely related industries, and these industries somehow use the data to make money. This fulfills the funders’ goal of stimulating economic activity through scientific research. However, none of this income makes its way to back those tasked with collecting and vetting the dataset prior to its publication (i.e., the journal), such that journals have no incentive to ensure that the data are either error-free or even present at all.
It’s therefore hard to see how the editorial office costs associated with the peer review and sharing of research data will be defrayed: authors don’t generally value journals with strict data policies, and once the data are out there there’s little scope for the journal to generate further revenue.
As with most problems of this sort, there are two immediate solutions. First, somehow make it much cheaper for journals to enforce their data policies, so that the costs can be covered with existing revenue. Second, get more money. One source could be the funders, as they are keen for the data from their projects to reach the public domain. Helping journals cover the cost of getting datasets from authors and putting them through review would ensure that a) authors do a much better job of data management, and b) data associated with that funders’ projects are complete and well annotated when they become public. Unfortunately, neither of these solutions seems likely, at least in the immediate future.
In sum, if open data is going to become commonplace (which it should), we have to recognize that journals are the most important conduit. Promoting open data has to be in their interests as well.
12 Thoughts on "Is There a Business Case for Open Data?"
The argument for peer reviewed data makes a lot of sense in the abstract, but what does it mean as a procedure for editors and reviewers? Does it simply mean that someone checks to see that there is data behind the study, or that the data support the claims made in the paper? If it is the former, then the verification can be done by an editorial assistant who simply checks a box and doesn’t require a trained peer reviewer.
If it is the latter, are reviewers required to reanalyze the dataset to confirm the authors’ results? Do they need to check to see whether there are errors (omissions, extreme observations) in the dataset? Do they need to suggest additional variables or models that weren’t included in the data?
If most scientific data has no generalizable use beyond supporting claims made in the paper, then it doesn’t make much sense to me that data need to go through editorial and peer review. Open does not necessarily imply vetted.
The key is that the data _might_ be reviewed, as authors will then work to make sure that the dataset isn’t the reason their paper was rejected. This alone would greatly improve data management practices in research.
When I review a paper I tend to prod it all over and then focus on the parts that seem fragile. If the datasets were available I’d check simple things like the columns being labelled and the number of rows match the reported sample size. If the data are a mess, I’d perhaps try to rerun a few of the basic analyses to see if I can get the same answers. If not, I’d probably end up recommending the paper be rejected, as it’s very unlikely that the authors produced a robust paper from a badly managed dataset.
Perhaps one can create a journal of data by subject matter i.e. Journal of Data: Genetics. The journal would use a subscription model. In this manner the market would speak as to if it were needed or not. Authors could pen a brief overview of the data presented. In this manner, an author could get two publications out of his research. Of course, the free lunch crowd would clammer for an OA journal of data and now one would learn if an author or granting agency was willing to pay to play!
There are quite a few general data journals, like Gigascience and Scientific Data. They are both open access and doing well, so there is a segment of the community that’s willing to pay to get their data and a data paper out.
Just so it’s clear (since this is about authors being ‘willing to pay’), with regard to data hosting, we don’t charge an additional, or premium, APC for hosting data with their articles. This is included. Our authors and reviewers do have to put in extra time, for sure, but we also have staff to aid them.
Isn’t a service like Code Ocean (https://codeocean.com/) primed to address this issue? Then it would just become a matter of incorporating it into the submission process, then making sure that either the editor checks it or that the reviewers do.
My understanding is that the focus of Code Ocean is on software and code, not data.
Code and data live together – one provides the ecosystem to create the other. Just having the data without the pipeline that created it would be disingenuous. This is why figshare and git are deeply important in this space, and complementary. Fortunate as well, that software is decades ahead of data in terms of adopting sensible licensing.
I tend to think of data as more universal — there are a lot of experiments that generate data that don’t necessarily create or even use code. So in some ways it’s a bigger cultural issue, and maybe the starting point is to get everyone used to the idea of open data, which will lead to better transparency in other areas. That said, I agree 100% with you as far as availability of code, which I lump in with other research tools and methodologies. This is an area that so far has been sadly neglected by efforts to increase transparency and reproducibility in the literature. My rant about this (from 2014) can be found here:
Even if there’s a database (e.g. Dryad) integrated into the submission workflow, the stumbling block is the authors. Editorial Offices can tell them that they need to provide all their data for peer review, but only about 10% actually give you all their data. About half give you none. So then the EO has to go through each paper and work out which other datasets they need to supply, and then winkle them out of the authors. It’s very time consuming, even for EO staff with expertise in the field.