Street sign, park nearby
Image via RP Norris.

Last month, President Obama showed off his dad-joke skills while announcing the appointment of the first US Chief Data Scientist. The focus of much of the White House’s messaging around this appointment has been on making the government’s own data publicly available. In his ‘memo to the American people’, however, Dr. D.J. Patil talked about acting as a conduit between government, academia and industry. In some ways, this latest move can be seen as a continuation of a US government push toward open data that mirrors efforts in Europe and elsewhere.

For a long time, there has been an expectation that researchers share data upon request with other academics but more recently, the trend has been towards making data widely and publicly available. In February 2013, the White House Office of Science and Technology Policy released a memorandum on Expanding Public Access to the Results of Federally Funded Research. While the 2013 statement got a lot of people’s attention, funding agencies have been moving towards open data for over a decade. In 2003, the NIH announced a data sharing policy in which they stated.

Data should be made as widely and freely available as possible while safeguarding the privacy of participants, and protecting confidential and proprietary data. To facilitate data sharing, investigators submitting a research application requesting $500,000 or more of direct costs in any single year to NIH on or after October 1, 2003 are expected to include a plan for sharing final research data for research purposes, or state why data sharing is not possible. [Emphasis theirs]

A similar Proposal and Award Policies and Procedures Guide (PAPP), came out of the NSF in 2007. For both of the largest governmental scientific funders in the US, researchers are required to describe how they intend to both manage their data and make it available to those who wish to build upon it. In other words, the US government fully intends that data sharing be a requirement for receiving federal funding.

Interest in data sharing isn’t just restricted to the sciences. In the humanities, the NEH also require grant applicants to submit a similar data management plan. There is also an Office of Digital Humanities within the NEH, that focuses on harnessing new technology. Its activities include educating researchers in data curation, the agency’s policies, and best practices for digital archiving.

While these policies and initiatives are clearly intended to show government support for data sharing, many in the data science world say that they don’t go far enough. There’s an argument to be made that merely requiring a plan in a grant doesn’t necessarily mean that data will be shared. After all, sharing data requires adding an extra step to the workflow of researchers that are already pressed for time. When the NIH began threatening that grants would not be renewed for those who failed to comply with their green OA policy, compliance jumped from 19% to 49%. Similarly, a truly effective data sharing policy may have to have consequences for noncompliance.

Internationally, the UK seems to be leading the field in terms of open data mandates. According to Sherpa/Juliette (which is jointly funded by JISC and RLUK), over a quarter of all UK based funders now have data archiving policies in place, including the Wellcome Trust, MRC, BBRC and most aggressively, EPSRC. EPSRC’s policy is important because it’s the first to cross the line from a statement of policy to a mandate with teeth. The policy Comes into full force in May 2015 and EPSRC promises to:

…investigate non-compliance; if it appears that proper sharing of research data is being obstructed EPSRC reserves the right to impose appropriate sanctions.

Given that the new EPSRC policy is based on RCUK guidelines, It seems likely that if the policy is successful, other research councils will follow suit.

That’s not to say that other governments are not being aggressive on this, everybody from the Canadian government to the Austrian Science Fund have policies in place. Many private funders are also getting involved; the Bill and Melinda Gates Foundation claim to have the worlds strongest policy on open access, which includes a requirement for open data. With the appointment of a Chief Data Scientist in the US, the new EPSRC policy, and the ever quickening pace of mandates, it looks like we may be at a tipping point for open data.

Why has it taken until now for us to reach this point? One thing that has held the open data movement back in recent years is the concern that while increased transparency is widely accepted to be good for science generally, some are concerned that sharing data might have career risks for researchers individually. These concerns have been articulated very well in The Kitchen previously and include the lack of citability of data, fear of getting scooped, and the desire to get proper credit for work done.

For some time now, some researchers, like the ones in this article in Science magazine, have actively advocated for data sharing. These pioneers of open data claim that while there are risks, on balance, the benefits outweigh them. According to the Knowledge Exchange report Sowing the seed: Incentives and motivations for sharing research data, a researcher’s perspective, which is based on interviews with academics, many researchers see data sharing as an important strategy to make their research and research group more visible. More quantitatively, Piowowar et al., found a 69% increase in citations for microarray cancer clinical trial data when the data was made freely available. In order to explore these issues further, Digital Science recently organized the first in a series of open data spotlight events for researchers. Nicko Goncharoff wrote a summary of the event in Research Information. One theme of the meeting was the need for a shift in the way that we value research output to give greater credit to data. On the other hand, there was also a lot of talk about the benefit to science of sharing data, and the ways in which it can benefit researchers directly, often in unexpected ways with other researchers applying techniques and ideas that the originating lab hadn’t imagined.

Last November, Alice Meadows wrote an excellent post based on Wiley’s data sharing survey of some 90,000 researchers in which she noted that a significant number are concerned that giving away their data might either cause them to be scooped, lead to them not getting adequate recognition, or have their work undermined. On the other hand, in that same post, Alice noted that 53% or researchers globally now do share their data. We’ve been hearing for a while now about the theoretical risks and benefits of data sharing, but the proof of the pudding, as they say, is in the eating. With just over half of researchers sharing data, either because they find it to be beneficial, or because their funders asked them to, again, it looks like we’re at a tipping point.

But what about the minority of researchers that aren’t ready to share data? If we’re going to address the understandable concerns that some researchers have and not simply ride roughshod over them, or end up with a significant minority of researchers that refuse to comply with data sharing mandates, we’re going to have to make sure that their concerns are addressed. Returning to the appointment of the White House Chief Data Scientist, part of Patil’s new job is to work with agencies to define best practices for data sharing. During his memorandum to the American people, he expressed a desire to…

position ourselves for the next wave of innovation and … for everyone to benefit holistically, and I want to emphasize that everyone benefit holistically

Perhaps, part of the thinking here is that if the US is to avoid falling behind in the race towards open data, the government and the funding agencies must shape their mandates in such a way as to mitigate the concerns of individual researchers, maximize the benefits, and apply adequate consequences for noncompliance. It will be interesting to see what comes out of this during the next year or so but one thing’s for sure: the growing need to support open data isn’t going away anytime soon.

Phill Jones

Phill Jones

Phill Jones is a co-founder of MoreBrains Consulting Cooperative. MoreBrains works in open science, research infrastructure and publishing. As part of the MoreBrains team, Phill supports a diverse range of clients from funders to communities of practice, on a broad range of strategic and operational challenges. He's worked in a variety of senior and governance roles in editorial, outreach, scientometrics, product and technology at such places as JoVE, Digital Science, and Emerald. In a former life, he was a cross-disciplinary research scientist at the UK Atomic Energy Authority and Harvard Medical School.

Discussion

23 Thoughts on "Are We at a Tipping Point for Open Data?"

“position ourselves for the next wave of innovation and … for everyone to benefit holistically, and I want to emphasize that everyone benefit holistically”

While a laudable aim, is this not making perfection the enemy of the good? By this declaration anyone who claims they will not benefit would surely be a roadblock to implementation. I’m struggling to think of any policy that could ever give benefit (holistically or otherwise) to everyone.

For example, ridding the world of nuclear weapons (surely a good thing, the best thing, indeed) would leave defence manufacturers losing out.

We have to be grown up enough to accept some people will feel like they’re losing out if science/humanism as endeavours are to progress.

Phill, while I follow your argument and agree with your claims, I take issue with how you frame this topic. By framing this issue as a ‘tipping point,’ you assume that there is one market, one group of scientists, working from a single and unified scientific perspective. I would maintain that there is no such Science, but many scienceS (emphasis on the plural)–groups of academics working within like-minded communities, whose behaviors are largely defined by what these communities agree is best for them. This is why data archiving is standard (and required) for publication in some journals, but not for others. So, rather than talk about this as if you’re attempting to change herd behavior (or the movement of the markets), I think a better framework would be to understand why data sharing has benefitted some communities of practice but ignored in others. And if the goal is to convert non-public sharing communities into public sharing communities, you’ll need to understand what is holding them back. Surely, it isn’t ignorance.

Hi Phil,

You’re absolutely right that in some fields data sharing is already the norm, which further goes to show that there’s an opportunity to be had in supporting the practice. In fields like genomics, for instance, have standard practices in place for gathering, formatting and curating data, and that’s great. What changing is that researchers who aren’t in those specific disciplines are now being asked to share their data. As time goes by, and researchers are getting more and more used to the idea, I believe based on talking to a handful of them that attitudes are shifting over time and just recently, there seems to me to be a shift from thinking of data sharing as a bit of a pain to recognizing that it has value.

Concerns remain of course, I highlighted those in my post. I agree that more work does need to be done to understand the risks and benefits, there isn’t a lot of solid research on whether the predicted negative and positive effects for individual researchers actually materialize, aside from Heather Piwowar’s analysis of how data sharing affects citation rates.

I agree that more work needs to be done to make sure of is that whatever policies are put into place address those concerns so that people are happy to share data., and as I say in the post, perhaps Dr Patil is the man to lead that effort.

It sounds like you advocate more mandates, Phil. But if a large minority of researchers are against it and the academic value system has to change to make it work then it is a questionable policy, one that could do real damage. What seems to be lacking is any sort of cost-benefit analysis.

I don’t advocate more mandates per se. I do advocate more data sharing as I believe that science does need greater transparency and data is as good a place to start as any. I haven’t seen any arguments that have convinced me that data sharing is bad for science, so I’m not sure what damage would be done to science. The only arguments I’ve seen have been about people’s personal career progression. There’s a tragedy of the commons aspect to that; it;s not easy to decide to share when nobody else is doing so. So perhaps if mandates cause everybody to share their data, that may enable people to do so while remaining competitive. If data sharing will accelerate science, it’s worth putting in the time and effort to figure out how to make that work to everybody’s satisfaction.

I also believe that funders do have a right to say how their money is spent, although they have a responsibility to do so wisely.

The point is that if you’d have asked researchers 10 years ago if they were in favor of sharing their data most of them would probably say no, the fact that most of them now say yes and the number is climbing is pretty strong evidence that that is the way that opinion is going in science.

Damaging people’s careers (or incomes) certainly counts as a significant cost in the data sharing policy or mandate issue. The policy question is not what is good for science but what is good overall? I myself have done federally funded research where the data I developed has significant commercial potential. That I should be forced to give this up is not a simple issue. The government got my results, which is arguably what they paid for. There are presently procedures for claiming data as proprietary and overturning these requires strong reasons, which I have yet to see presented.

As I wrote in the post above and also in the comments a number of times. It’s important to shape policies and mandates to address everybody’s concerns.

Note that a data management plan mandate is not a data sharing mandate. The standard DMP language allows the researcher to opt out of sharing, if they have a reason for doing so.

It does, that this is an example of some of the nuances and details that still need to be thought about. Just because there are still questions and subtleties exist, doesn’t mean that data sharing is a bad thing.

No one is suggesting that data sharing is a bad thing, in its place. As Phil Davis notes, there is already a lot of data sharing in some specialized communities. The policy issue is what the rules should be, especially whether new rules are needed, and that is far from clear at this point. Practices and mandates are two very different things. I find the whole concept of regulating the research community to be problematic.

I’m glad that we can agree that data sharing isn’t a bad thing.

I think that it’s worth noting that as somebody interested in serving the academic market with technology, I’m primarily responding to what I see as a change in attitude to data sharing among academics and funders.

It’s worth noting that much of the advocacy surrounding data sharing is coming from academics, the researchers themselves and librarians. I myself am a former scientist, many of my friends are researchers and my wife is a tenured biologist. There is a growing belief that data sharing is important for academic transparency. I personally believe that that is true but am conscious that we must be careful in setting mandates and need to have constructive conversations about the details.

To refer to funder mandates as regulations is disingenuous. Funders have stipulations attached to their grants for obvious reasons. To get an extramural grant renewed, you have to actually show that you’ve done or made substantial progress towards doing what you said you would. That can include publishing articles, purchasing specific pieces of costed equipment, collaborating with other researchers and sharing data. Researchers who don’t do what they say that they would don’t get their funding renewed because the funders want a return on their investment. Funding stipulations aren’t regulation, they’re a means to accountability.

I know that some people don’t like the idea of greater funder involvement in research, but that’s beside the point. The reality is that the market is moving in that direction. On a purely business level, this represents an opportunity for publishers and information companies to support researchers in meeting these requirements while protecting their interests.

In Federal research funding the grant contract terms are governed by the Federal Acquisition Regulations, or FAR, so agencies cannot simply add requirements. The FAR includes extensive language governing rights in data. The provisions can be complex but basically I think these rights are negotiated. To my knowledge, requiring the release of data will require amending the FAR, but I may be wrong. It is a technical legal issue.

On the US Federal side there are two different programs, which Phill alludes to. The Open Data program is about making government data accessible via the Data.gov portal. The Public Access program created by the OSTP memo applies to researcher data developed during funded research and this is where the data sharing policy and mandate issues arise.

The Public Access program is still emerging but there is one major milestone that seems not to have been much noticed. To begin with the Agency for Healthcare Research and Quality (AHRQ) has published its Public Access plan at http://www.ahrq.gov/funding/policies/publicaccess/index.html. AHRQ is going to fund, hence build, the first full scale Federal data repository to collect, house and provide access to all of the data underlying all of the journal articles that flow from its funding. The prospect is conceptually breathtaking. The reality will be fascinating to watch.

In fact this could turn into something of a circus, as procurements like this sometimes do. (Federal procurement has been a major research area of mine.) This repository project is for something completely new and the number of prospective bidders is huge. The potential for confusion is very great. AHRQ says the data repository will be up and running in just eight months, but two years seems more likely.

To some of the points made above, I believe that “open data” is usually a shorthand for an attempt by those who have the resources to turn data efficiently into money, but lack the intelligence and skill required to create the data to begin with, to put their thumbs on the scale of what is very close to an efficient market in data management and release.

On the contrary. Some of the biggest proponents of open data are data scientists themselves. Part of the motivation for them is to expand the types of academic outputs that researchers can get credit for.

“Some of the biggest proponents of open data are data scientists themselves.”

QED. (The “data scientists” are precisely the folks who can efficiently turn data into money.)

Similarly, some of the biggest proponents of “free culture” are those who stand to profit from doing things like selling ads next to free content or devices upon which to view it.

The other issue here is the notion of credit for things like datasets. Yes, citation and credit is due, but it’s important to realize that any credit granted is likely to be a small fraction of that granted for actual discovery. Data is useful, but without interpretation it’s meaningless. A technician collects data, a scientist understands data. Careers are made on that understanding, on making the intellectual leap based on the data, not on just the collection of raw data itself. So to be fair, any credit given for data is likely to be a much smaller level of credit than that given to the person who uses the data to drive a breakthrough.

You’re right, I miss-spoke, or miss-typed, or something. I meant people that there are people who produce data who would like to be able to take credit for it.

It’s an interesting idea that people who want to commercialize data are trying to persuade researchers to make their data public but I don’t think that that’s really what’s going on here.

I agree–I’m not sure it’s a commercialization scheme, but I do think there are scientists who need the data that others produce, and it would certainly be in their career and financial interests if there were more good data publicly available for them to use.

Comments are closed.