While there’s a growing recognition in the value of data archiving and public availability of research data for reuse, putting things into practice is proving a long, slow process. One of the biggest stumbling points is that researchers only rarely receive formal training in data management, and are often left to work out their own schemes for how they will collect and store information. This can result in a beautifully organized information resource that is designed for long term access, or in a haphazard set of poorly labeled files found strewn across a variety of obsolete devices and hard drives.

Data availability and re-usability starts with best practices in collecting and storing data in the first place. The exasperatingly funny video below shows what happens when those best practices are ignored, something that’s much more prevalent than it should be.

Thanks to Seth Denbo for sending this video along.

David Crotty

David Crotty

David Crotty is a Senior Consultant at Clarke & Esposito, a boutique management consulting firm focused on strategic issues related to professional and academic publishing and information services. Previously, David was the Editorial Director, Journals Policy for Oxford University Press. He oversaw journal policy across OUP’s journals program, drove technological innovation, and served as an information officer. David acquired and managed a suite of research society-owned journals with OUP, and before that was the Executive Editor for Cold Spring Harbor Laboratory Press, where he created and edited new science books and journals, along with serving as a journal Editor-in-Chief. He has served on the Board of Directors for the STM Association, the Society for Scholarly Publishing and CHOR, Inc., as well as The AAP-PSP Executive Council. David received his PhD in Genetics from Columbia University and did developmental neuroscience research at Caltech before moving from the bench to publishing.


17 Thoughts on "A Painful (but True-to-life) Look at Data Availability and Reuse"

Perhaps the expectation is unrealistic. I did staff work for the US Interagency Working Group on Digital Data (IWGDD) which led to the present US Public Access data policy. It merely requires that each proposal include a Data Managemnt Plan and no specific practices are specified. The reason is because we determined that best practices are often different for different communities.

I don’t think it’s unrealistic to ask researchers to store their data in a manner where 1) they can find it, and 2) they can understand it. If these aren’t realistic expectations, then why have a Data Management Plan at all? Aren’t these the basics of Data Management?

Who is they? The basic DMP requirement is that the data shall be made available, unless it is explained why it cannot be.

“They” is the researcher him or her-self as the starting point. If one is going to use one’s own data (let alone make it available to others), this is necessary. Sometimes I wonder David, if you just enjoy arguing for the sake of arguing.

The NSF includes in their DMP requirements, “plans for archiving data, samples, and other research products” (http://www.utc.edu/research-sponsored-programs/success-strategies/nsf-data-management.php).

The NIH requires that “each dataset will require documentation” (https://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm), wants key behaviors to include that the researcher “Learns, stays up to date on and incorporates the basic components of a data, records, and knowledge management process.” (https://hr.od.nih.gov/workingatnih/competencies/occupation-specific/2210/datamgmt.htm), asks, “What data documentation will be shared (e.g.,metadata, descriptors, schema) so that others can understand and use the dataset and to prevent misuse, misinterpretation, or confusion (https://grants.nih.gov/grants/sharing_key_elements_data_sharing_plan.pdf).

If basic organization, storage and documentation of data is “unrealistic”, then perhaps we should give up on the scientific enterprise entirely.

That’s okay, David. I wonder about your apparent inability to understand what I am saying. I do not think we need regulations that require people to understand their own data. The idea is so strange that I had to ask about your meaning.

I am responding to your claim that “One of the biggest stumbling points is that researchers … are often left to work out their own schemes for how they will collect and store information.” Given that this freedom is the essence of Federal data policy, I do not see it as a stumbling point.

I struggle to understand what you’re saying when it is nonsensical, or when it has very little to do with the points made in the post or the video. Here you are clearly creating a strawman argument and then tearing it down. No one is arguing in any way that there must be regulations on how data is organized and archived. The argument is that training in doing so is sparse, and that this harms our ability to maximize the value we can derive from that data. My argument is that more training in best practices and data management strategies would improve our ability to use and reuse that data. A call for education is not the same thing as a call for regulation.

Where do things like pre-registration come into play in this case? Can deciding in advance what will be collected and how it will be used help minimize poor data management?

Brilliant video. Still unaddressed is the reproducibility of the data and thus the validity of the reported research results and the potential of amplifying both the interpretation of the original work and its subsequent use either in the same area much less the suggested translation across research as suggested in the video.

This issue has come up in the past (some of which has resulted in retraction of articles. It is one of the results of how articles are vetted where reviewers do not access the data to validate as part of approval for publication and authors don’t have the resources or time which could significantly delay publication.

I am not sanguine that even having a uniform, transferable, protocol would solve this problem, exacerbated by the use of impact factors and similar measures as surrogates for the measure of a researcher’s oeuvre.

At a minimum it has to start with a repository that can accommodate multiple contributors and multiple storage formats / locations, all tied to a single, perpetual GUID that can be cited later. That way, the data at least has a home and a basic organizational structure–wiki, data, code, hypotheses, resulting analysis, etc. that can be accessed later for a variety of purposes including reproduction, preprints, review, or post-publication support.

We have built the Open Science Framework to address these basic organizational and storage issues for research projects. We have also developed ways to surface best practices by allowing groups to self-manage their own project structure without concern for WHERE the data is stored or how it is produced. By enabling connections to most major software and storage platforms, data can live where it needs to but can still be accessed and shared from a common OSF project that anyone with permission can access.

Wonderful video. It points out many of the problems in retrieving and using legacy data. I loved it that he mailed the only existing copy!
The only way to be sure data are retrievable is for someone to independently retrieve them. Maybe this should be part of the acceptance process, though it’s effectively adding another review step.

I worked as what used to be called a ‘computer programmer’ in a local government agency before I went to grad school. We had extremely strict standards for documentation, probably because there were financial issues involved. I was astonished when I went to grad school about how cavalier people were about data and documentation and how little they understood about how to keep track of things. Throughout my career, where I would find people who were using, for example, data sets someone else had created without full knowledge of how those data sets were created, and one project that I worked on where I was the only person to have saved a copy of the questionnaire that had been used, even though I had joined the project after all the data had been collected. People also do not test programs and there are some high-profile examples in the literature where a lab used an ad hoc program written by a lab member that turned out to have some kind of bug in it. The resources, time and general knowledge required to produce good documentation are considerable and often under-estimated. There is little or no training on how to do this either.

I know the feeling. My field is survey research and I have created dozens of data sets and associated documentation to standards fixed as long ago as 1972. I also taught my students to use the same standards (see my page Survey Analysis Workshop). However I have recently been looking at data files distributed by UKDS for the British Social Attitudes Survey and encountered numerous problems, as detailed on my page British Social Attitudes 1983 to 2014: Cumulative SPSS files. The cleaning of files has taken several months, but have now been deposited with UKDS to whom all enquiries should be made.

I spend much of my day job working with people throughout my university trying to develop appropriate policies, services and infrastructure to effectively manage research data. I’ve used that video (created by librarians at NYU) many times and the reaction is always the same — nervous laughter and an acknowledgment that it is all too common. While a great deal of good work has been done in the past several years on developing standards and best practices, most researchers are not aware of them and do not implement them. Even in areas like genomics, where there are decent community standards and well developed disciplinary repositories, practitioners recognize that they’re still a long way from where they need to be. UAB is typical in that data management practices are completely decentralized and training is left to the whims of individual lab directors and mentors, with perhaps some basic required IRB training that barely scratches the surface. Regarding federal requirements, NIH has indicated that their DMP requirements will continue to evolve and eventually be scored as part of the grant review process. This will increase the pressure on investigators to become more sophisticated with their data management practices as well as increasing pressure on institutions to develop resources to help them do that.

I’ve been watching this video ever since I became a data librarian – nearly 5 years ago. I used to hate it because I thought it was just a long list of everything that could go wrong, and didn’t model good (advanced) data management which I was so keen to learn and promote.
Now, I find it hilarious because it’s true, and it just keeps becoming truer! Every time we show this to new grad students and researchers, the same nervous laughter bubbles up. Despite the impressive work already done developing policies, institutional infrastructure and training, data sharing in some disciplines is still an ‘affordance’, it’s a nice-to-have love note to the future that only happens in the movies.

The paradigm shift that is required for researchers to share data will happen when data sharing advances their own research endeavours, and becomes part of the workflow. Understanding data sharing as a social movement will help us get there, I think, and forgive ourselves for the current limitations of policy development and training. Collaborative efforts between researchers and librarians in the software carpentry space give me hope.

Comments are closed.