Author’s Note: Looking back at this 2017 post brings a mixed bag of thoughts. First, the fortunes being made with collecting, curating, and selling access to consumer data still haven’t spilled across into research data, and that’s likely because a) relatively few research datasets are available, and b) for the most part, the ones that are available have inadequate metadata and incompatible structures, so that combining datasets for meta-analyses is scarcely worthwhile. Until we address the problem of missing research data – which (full disclosure) we’re trying to do with DataSeer – we can’t really make much headway with getting it all into a consistent format.
However, while combining datasets for re-use is a core feature for consumer data, it’s only one of the reasons for sharing research data. Open data also allows readers to check the results for the paper itself, and perhaps this is where our attention for the ‘business model for open data’ should turn. In particular, peer review is considerably simpler when the authors submit computationally reproducible manuscripts. Editors and reviewers can then be sure that the datasets support the analyses and hence the results, allowing them to focus solely on the appropriateness of the experimental design and the significance of the conclusions. It’s therefore conceivable that journals could reduce the APC for computationally reproducible articles (or hike it for non-reproducible ones), thereby incentivizing the extra effort required to required to produce them.
No matter what route we choose, it’s clear that our current incentive structures around open science (mostly strongly worded policies and the lure of extra citations) are not getting the job done, and we need to consider alternatives. Money can enter the equation at a few places: by only funding open science, as exemplified by Aligning Science Across Parkinson’s, or by offsetting the extra effort required by researchers with additional financial resources, by making things cheaper or non-open science more expensive. Let’s see where we go.
Is There a Business Case for Open Data?
On the first day of Open Access Week, Figshare released the 2017 edition of their ‘State of Open Data’ report. The document contains a number of thoughtful pieces from the European Commission, the Wellcome Trust, and Springer Nature (among others), as well as the results of Figshare’s second annual survey of researchers about their experiences with data sharing.
The report as a whole is a useful microcosm on where we currently stand with open data. Large, influential organizations like the European Commission and Wellcome have thrown their weight (and policies) behind the idea that the data underlying scientific research should be available to everyone. By contrast, the response from researchers has been mostly polite indifference: Only 20-30% of respondents claimed to have shared their data ‘frequently’, and just under half claimed either ‘rarely’ or ‘never’.
Despite the optimism of the 2017 survey infographic, only a small portion of the researcher community is actively supportive of open data, especially when it’s their own data that might become open. Overcoming this lassitude and/or resistance is a formidable task, and will require coordinated effort from funders, publishers, and institutions. Mark Hahnel (of Figshare) despairs at the scale of the problem in the 2016 report:
“We are not the first, nor will we be the last to point out that our academic system is fundamentally broken – publication in particular journals assures funding and job prospects; no-one becomes a professor for sharing their data.”
Despite its pessimism, this quote does hint at an answer: the link between data sharing and academic success is realized only when “nobody publishes in particular journals without sharing their data”. The fact that one can currently publish even major discoveries without the supporting data seems absurd, but that’s where we are these days.
A fair proportion of journals have adopted data sharing policies (e.g. 44% of the biomedical journals examined here; 38% here). However, few journals ensure that the datasets are complete and well annotated (e.g. Roche et al’s study), and only a tiny fraction include evaluation of the data in the peer review process. Enforcing data policies is hard work for journals because authors do not spontaneously supply all of their datasets. Instead, the journal has to cross-reference the manuscript with the data provided and chase down all of the missing datasets (this is typically about half). With specialist staff this effort is just about possible for accepted manuscripts; achieving it for all submissions so that reviewers can look at the data is nigh impossible.
Outside the ivory tower, a whole new economy is springing up around data. Some firms amass data on people and then sell access to advertisers who want to reach a particular segment. Others trawl open datasets with sophisticated algorithms in the hope of finding unforeseen (and perhaps profitable) insights. ‘Data is the new oil’ can be heard at seminars and conferences all round the world. A lot of money is being spent on collecting, curating, and storing all these data.
Back in academic publishing, we have seen open access (OA) grow from a fringe concept into a widespread business model, and the major publishers all now embrace and promote OA. Can we hope for the same with open data? The economic forces behind the adoption of OA are clear, authors are highly motivated to publish and the journal receives money for each article. More funders than ever now require authors to direct their papers towards OA journals.
Currently, this is not the case for open data: authors don’t expect to pay the journal for reviewing their datasets, and there is resistance to the idea that authors should pay more than a few hundred dollars for data hosting. Journals could roll the extra costs associated with reviewing data into their APCs or subscriptions, but investing in data sharing seems to make the journal less, and not more, attractive to authors. For example, the shift in submissions from PLOS ONE to Scientific Reports coincides with the former’s adoption of a more stringent data policy. As long as authors see strong data policies as just another obstacle to publishing their work, journals will struggle to commit editorial resources to enforcing those policies.
This situation could change if we shift to a world where papers without (peer reviewed) data are seen as weak and unsupported, as then authors would place a premium on publishing in journals with strong data policies. However, it is hard to see how publishers can hammer this message home without casting doubt on their back catalogue or alienating their more senior authors and editors.
Are there other economic forces that would push publishers to ensure that all their articles are accompanied by the underlying data? In some ways, datasets associated with research papers are highly desirable, as they are collected by trained experts under carefully controlled conditions. If the data have been included in the review process, they should also be well annotated and hopefully error-free. One immediate obstacle to making money with shared scientific data is licensing: many open datasets associated with research articles use the CC-BY-NC license to forestall re-use for commercial purposes.
Another issue is the granularity of research data. The vast consumer behavior datasets collected by Google and Facebook have the individual as the common thread; similarly, the big datasets coming out of cities have common reference points (e.g. locations). This is not the case for the datasets arising from scientific research: most are small and collected to answer a very specific question. There is no conceivable thread joining a dataset on the temperature of penguin eggs to the DNA sequences of 1,000 fruit flies. Of course, either dataset could be combined with other penguin or Drosophila data, but these are unlikely to achieve the scale where they could have some commercial application. This story on data from mosquito surveys in the US sheds some light on the scope of this granularity problem.
Going back to the 2016 Figshare report on Open Data, there’s a notable section on the commercial aspects of open data in the piece by Sabina Leonelli:
The very idea of scientific data as artefacts that can be traded, circulated across the globe and re-used to create new forms of value is indissolubly tied to market logics, with data figuring as objects of market exchange. National governments and industries that have invested heavily in data production (…) are keen to see results. This requirement to maximize returns from past investments, and the urgency typically attached to it, fuels the emphasis on data needing to travel widely and fast to create knowledge that would positively impact human health.
In this vision the data associated with published papers end up in closely related industries, and these industries somehow use the data to make money. This fulfills the funders’ goal of stimulating economic activity through scientific research. However, none of this income makes its way to back those tasked with collecting and vetting the dataset prior to its publication (i.e., the journal), such that journals have no incentive to ensure that the data are either error-free or even present at all.
It’s therefore hard to see how the editorial office costs associated with the peer review and sharing of research data will be defrayed: authors don’t generally value journals with strict data policies, and once the data are out there there’s little scope for the journal to generate further revenue.
As with most problems of this sort, there are two immediate solutions. First, somehow make it much cheaper for journals to enforce their data policies, so that the costs can be covered with existing revenue. Second, get more money. One source could be the funders, as they are keen for the data from their projects to reach the public domain. Helping journals cover the cost of getting datasets from authors and putting them through review would ensure that a) authors do a much better job of data management, and b) data associated with that funders’ projects are complete and well annotated when they become public. Unfortunately, neither of these solutions seems likely, at least in the immediate future.
In sum, if open data is going to become commonplace (which it should), we have to recognize that journals are the most important conduit. Promoting open data has to be in their interests as well.
Discussion
3 Thoughts on "Revisiting: Is There a Business Case for Open Data?"
Hi Tim,
Thanks for this rerun. It’s still a timely and important article four years later. What I’ve noticed from my vantage is that, in addition to the reasons you’ve described, there also remains lack of enthusiasm for data sharing. Part of this has to do with the sheer weight of the reality involved—standardizing fields, validating data points, finding missing data points and sets, confirming these procedures and analyses with the original PIs (who have often moved on to other work by then), and so on, and of course, building and maintaining systems that make all this data exploration possible and accessible. Because of this weight, data sharing of the real substantive variety you describe is all still very much a niche enterprise, moving disease by disease, field by field.
Now, you can argue that it’s the journal editors’ responsibility to pull all this data together and vet it prior to publication, but honestly, this task is way over the head of editorial staff. Vetting medical research data at the desk level requires teams of highly trained and specialized PhD statisticians, so what you more often see—and the reason this moves disease by disease—is that there first needs to be a consensus in a particular research community that this data collaboration effort would even be worthwhile. This agreement is then followed by one-off grant applications to create such an effort—all of which has nothing to do with journals and the datasets they may or may not have collected by journals. So, funding for these undertakings ends up being tenuous and far from robust. Participation as well. There isn’t a long line of scientists who are waiting to play with all this combined data.
In the meantime, there are people in the open community (myself included) who are out there preaching about the benefits of an open data future. But until we can build these systems and these systems begin to prove their worth, it’s unlikely that they’ll be funded or used at scale.
What’s the answer? There isn’t just one. Maybe publishers will eventually get behind building some sort of unified data clearinghouse. Maybe scientists and/or their government funders will get behind some sort of All-Scholarship Repository. Or maybe the most realistic approach will continue to be one-off collaboration like we’ve seen to-date. If and when these various efforts start to produce breakthroughs, then more systems like these will be built and enthusiasm for their use will increase.
Thanks again for the rerun!
Best regards,
Glenn Hampson
Hi Glenn, thanks for all these thoughts. Your line “Vetting medical research data at the desk level requires teams of highly trained and specialized PhD statisticians” reminds me of something I heard a while back (I can’t remember where): one route to high quality open data might run through the professional statisticians, whereby they push for open data as part of their professional and ethical obligations.
Statisticians are certainly heavily involved in current data collaboration efforts (e.g., DataSpace, for HIV/AIDS vaccine research), but they aren’t in charge—they’re essential parts of the solution but working under the employ of research consortia who have strict controls on who gets to see data and under what conditions. In this case, it would violate the agreements these groups make with participating researchers (and possibly jeopardize the science) to push for any type of open that is more open than envisioned. Personally, I think the route to large quantities of usable, cross-comparable, high quality open data begins with seeing “open” as just one tool among many in our collaboration toolbox, not as an end in itself. Collaboration networks like Sage share data effectively and at scale using guidelines that work for their group, not one-size-fits-all global prescriptions of what open data is and is not. We can encourage and enable more data collaborations like this, and let this enthusiasm and innovation grow best practices and lessons of experience that can be shared and replicated in science, rather than requiring that all roads go through Rome.