While public access to research articles is a fact of life for much of the scholarly community, access to research data – while a top priority for many governments and other funders, who see it as the key to future economic growth – remains a challenge. There are many reasons for this, both practical (eg, lack of infrastructure) and professional (eg, lack of credit, getting scooped). The publishing community can and does already help with the former, for example through support for NISO, CrossRef, CODATA, and other organizations and, increasingly, the development of data sharing and management solutions. Resolving the professional issues, however, will almost certainly require action by research funders and institutions.
Earlier this year, in an effort to establish a baseline view of data-sharing practices, attitudes, and motivations globally, across a cross-disciplinary set of researchers, Wiley surveyed around 90,000 recent authors of papers in health, life, physical, and social sciences, and humanities. For the purposes of our survey, in order to better understand how researchers themselves view data sharing, rather than imposing any definition of either data or data sharing on them, we allowed the respondents to decide what it means to them. Several of the more common activities researchers see as data sharing – such as sharing at conferences, or on request – would fall short of many definitions of data sharing, which demonstrates why it is so important to listen to the views of individuals in this discussion.
2,886 researchers responded (3.2%), of whom 2,255 (2.5%) were actively working on a research project or had been in the last two years. The respondents skewed towards the Americas (52%), with 30% from Europe, Middle East & Africa, and just 18% from Asia Pacific. The vast majority (87%) were from universities, colleges, and research institutes, with the remainder working in a mix of industry, government, medical, and other organizations. In terms of disciplines, most respondents (37%) are working in life sciences, compared with social sciences and humanities (25%), physical sciences (22%), and health sciences (16%).
So, based on what our respondents told us, who is sharing what, how, and why (or why not)?
The overwhelming majority – 82% – produce data in spreadsheets and CSV files, etc, with only 12% of respondents creating relational databases. Thirty-eight percent of respondents create two-dimensional images as data in the course of their work; 3D images, 12%. Twenty-two percent of respondents are creating executable code/models, 14% collect transcripts and other data from interviews, and 11% are generating video/audio recordings. Surprisingly (to me anyway), for the most part, these files are (relatively) small – over 60% are less than 10GB and only 3.5% are larger than 2TB.
How researchers report that they are sharing data is also revealing. Over two thirds (67%) of the 52% who report sharing (ie around 38% of all respondents) stated that they have done so as supplementary material in journals (which broadly matches Wiley’s own experience), compared with only 19% who say they use a discipline specific repository and just 6% who report using a general repository, such as Dryad or figshare. As mentioned above, many researchers report sharing data in informal, often impermanent ways, that wouldn’t meet formal requirements, such as sharing at a conference (57%) and sharing on request via email, direct contact, etc (42%). In addition a somewhat staggering 37% say they are using a personal, institutional, or project website to share data – again, unlikely to meet any data sharing mandates, and certainly not the best way of ensuring any kind of long-term preservation of the data.
Perhaps most interesting is why researchers say they share data – and why they don’t. For example, German researchers reported sharing the most (55%), with three quarters of those respondents doing so in order to increase the visibility of their work and to ensure public transparency and reuse. Chinese researchers, however, reported being least likely to share (36%), with half the respondents stating that this is because it’s not a funder requirement. Chinese researchers are also less likely than their peers in other countries to say that they see data sharing as a personal responsibility. Other reasons given by respondents to share (or not) data highlight some interesting cultural differences. For example, twice as many Japanese researchers say they worry about being scooped; two thirds of Brazilian researchers say they’d be willing to share their data if they got proper credit or attribution for this; and only 14% of UK researchers say that they are sharing data via public or discipline-specific repositories, compared with a global average of 25%.
The main reason respondents cited for not sharing data is around IP/confidentiality issues, especially in health science, where it was cited as a reason by 68% of respondents, compared with an average of 42% across all disciplines. No funder requirement was the second biggest reason given for not sharing data (36%), while concerns about being scooped, and about possible misuse or misinterpretation of data were ranked third by respondents (26% each).
Unsurprisingly, respondents from different disciplines report different concerns and motivations. For example, as you can see from the infographic, respondents from social sciences and humanities as well as the physical sciences would be motivated to share their data in order to increase the visibility and impact of their work. Life scientists, however, would be more motivated to do so if they were guaranteed to get credit, while respondents working in health science, told us that they are most concerned about privacy and ethical issues around data sharing.
So what are the overall lessons learned from our survey?
Publishers have, by accident rather than design, hosted data in the form of supplementary material for a number of years now. However, this compares poorly with other services which are far better suited to the curation and long-term preservation of research data. We could undoubtedly do more, therefore, for example, through requiring journal authors to archive their data. Some journals and societies already do this, including Molecular Ecology, which introduced a data archiving policy in 2011 when it adopted the Joint Data Archiving Policy. Since then, the journal has deposited over 1,000 data packages in Dryad – an extraordinary achievement in just a few years. Examples of societies that have introduced data-sharing policies for their journals include the American Geophysical Union, the Society for the Study of Evolution, and the British Ecological Society, whose Executive Director Hazel Norman recently wrote an excellent article about their experience. And most genetics journals require authors to deposit any DNA sequences they reference in GenBank.
Next, with lack of funder requirements cited by respondents as a major reason for not researchers not sharing their data, governments and other funding bodies need to play a more active role. Although some already require their researchers to submit data management plans, these are not always enforced – and many funders have not yet even developed their requirements. Support for and investment in the data repository infrastructure is also critical – discipline-specific and general repositories offer researchers a simple, safe, and permanent way to share their data, but they are underused at present and more investment will be needed in order to support any significant growth in future demand.
Last, but by no means least, we urgently need to find ways to credit and attribute researchers for data sharing – again, an area where institutions and funders need to show leadership. Not only will this require adoption of data citation standards, but also a cultural change in how researchers are rewarded for creating, analyzing, and preserving data. If adopted, the Contributor Roles Taxonomy CRediT initiative, led by Amy Brand and Liz Allen (see my interview with Amy) could help significantly with this. Varsha Khodiyar, Karen Rowlett, and Rebecca Lawrence also make some useful suggestions in their recent Learned Publishing article, including the expansion of data publishing opportunities and implementation of – and recognition for – data peer review.
There is certainly no one-size-fits-all, silver bullet solution to the challenges of data sharing – different communities, researchers, organizations, and countries are likely to continue to need different solutions. But there is a real opportunity here for the wider scholarly community – funders, institutions, publishers, societies, and others – to work together to develop data-sharing standards and best practices, as well as to encourage the use of existing repositories and create new ways of sharing data.
A full report including additional data will be available in due course. Thanks to colleagues Laura Fedoryk and Liz Ferguson for their help
Stop Press: A Knowledge Exchange report, Sowing the seed: Incentives and motivations for sharing research data, a researchers’ perspective, published yesterday (November 10) provides a much more in-depth look at many of the issues we covered in our survey, and includes recommendations for how all stakeholders – funders, societies, publishers, research institutions, data centers and repositories, and Knowledge Exchange itself – can encourage and support research data sharing. (Note, link was updated on December 18, 2014 as original report was amended)