We tend to think of research as either being reproducible and thus valid, or irreproducible, and questionable. This sort of binary thinking is problematic, because there’s a large body of research that’s entirely accurate but not easily reproduced. Do we need a new term for results fall into this in-between zone?

At the recent STM Annual Meeting in Washington, Moshe Pritsker, founder and CEO of the Journal of Visualized Experiments (JOVE) gave a talk about the gaping hole present in efforts to drive scientific reproducibility. Enormous amounts of effort, money, and regulation have been put toward opening up the data behind published experiments. But very little attention seems to have been directed toward the protocols and methodologies used to collect those data.

complicated experiment

While I’ve made this argument before, it bears repeating: If I want to reproduce your experiment (or check to see if your conclusions are valid), then access to your data is only part of the puzzle. Your data may accurately support your claims, but if you performed your experiment in a biased or poorly conceived manner, I may not be able to see that from just looking at the data. I need to know how you gathered it. Further, if I want to reproduce your experiment, then I need to know how you did it.

While there is great promise in the reuse of data, there is likely just as much, if not more progress to be gained from public release of detailed experimental methodologies. Many experiments are designed to answer specific questions under specific conditions. The data generated may not be of much use outside of answering just those questions, but the methods used to generate the data can be broadly adapted to ask new questions. This failure of knowledge transfer hinders scientific progress.

Journals can greatly improve the reproducibility of research by requiring methodological transparency. The print paradigm of journal publishing led us to poor practices in an attempt to save space and reduce the number of printed pages. When trying to cut down an article to reach an assigned page/word limit, usually the first thing to go was a detailed methods section. In a digital era where journals are doing away with page limits, why not add back in this vital information? For a journal that still exists in print, why not require detailed methodologies in the supplementary material? If you have a policy requiring public posting of the data behind the experiments, why not a similar policy for the methods? To their credit, Nature has expanded their methods sections and Cell Pres has implemented STAR Methods, doing away with page limits to create methods sections that are actually useful.

But even with openly available methodologies, we still need to recognize that science is hard. Some research results stem from once-in-a-lifetime events, like a particular storm or celestial event. A hurricane can’t be replicated.

Often, a bench technique will take years to perfect, and even then, some things can only be done by the most skilled practitioners. This can lead to scientific results that are entirely accurate yet very difficult to reproduce. An inability to replicate an experiment can tell us more about the technical skills of the replicator than the validity of the original work. Maybe you can’t reproduce my experiment because you’re not very good at this particular complicated technique that I spent much of my career mastering. Does this mean that my work should be labeled “invalid”? Mina Bissel from the Lawrence Berkely National Laboratory puts it succinctly:

People trying to repeat others’ research often do not have the time, funding or resources to gain the same expertise with the experimental protocol as the original authors, who were perhaps operating under a multi-year federal grant and aiming for a high-profile publication. If a researcher spends six months, say, trying to replicate such work and reports that it is irreproducible, that can deter other scientists from pursuing a promising line of research, jeopardize the original scientists’ chances of obtaining funding to continue it themselves, and potentially damage their reputations.

This is why claims like the oft-cited one from Amgen are so vexing. First, any time this “study” is mentioned, it must be made clear that it is not an actual study, but was instead a commentary that was published with no supporting data whatsoever. They claimed that they tried to reproduce 53 landmark studies and could only make 6 of them work. To this day, despite calls for further information, Amgen has only made public data from 3 of the 53 claimed experiments.

We have no idea how hard Amgen really tried to reproduce these papers. How many people worked on each one, what were their qualifications and how much time did they spend troubleshooting each experiment and mastering every technique involved? Did they do the dogged detective work that a researcher like those in Bissel’s lab did and spend an entire year tracking down the one minuscule methodological variance upon which reproducibility rested? Are their claims of irreproducibility reproducible in any way?

While we can reduce the number of experiments that fall into this “too hard for me to reproduce” category through the availability of detailed research protocols, we still need to tread lightly around those results that do require such expertise. It is always preferable instead to do a new experiment that tests the original experiment’s conclusions, rather than just trying replicate it, and that may be the best way to consider whether an experiment is valid: do its conclusions stand up to further experimentation? The benefit of this type of approach to reproducibility is that it offers potential for uncovering new knowledge, rather than just repeating something already known.

At the very least, we need a new term for these works that are essentially irreproducible, but not invalid. Any suggestions for a term that is neutral and does not denigrate the validity of the work in question but that notes the difficulty of reproduction would be welcome.

David Crotty

David Crotty

David Crotty is a Senior Consultant at Clarke & Esposito, a boutique management consulting firm focused on strategic issues related to professional and academic publishing and information services. Previously, David was the Editorial Director, Journals Policy for Oxford University Press. He oversaw journal policy across OUP’s journals program, drove technological innovation, and served as an information officer. David acquired and managed a suite of research society-owned journals with OUP, and before that was the Executive Editor for Cold Spring Harbor Laboratory Press, where he created and edited new science books and journals, along with serving as a journal Editor-in-Chief. He has served on the Board of Directors for the STM Association, the Society for Scholarly Publishing and CHOR, Inc., as well as The AAP-PSP Executive Council. David received his PhD in Genetics from Columbia University and did developmental neuroscience research at Caltech before moving from the bench to publishing.

Discussion

22 Thoughts on "Reproducible Research, Just Not Reproducible By You"

Repositories (hopefully more than one) are indeed useful, but as we’ve seen with every other type of repository, participation is pretty minimal unless one offers motivation. That may be in the form of the stick (no publication, no funding unless you release your methods) or the carrot (publish a methods paper alongside your results paper and receive additional career credit).

It is very true that getting participation and sharing is very hard for new repository efforts. In the first years, most people simply don’t know that figshare or Dryad exist. Even when the effort is from a major publisher (ex. Nature’s Protocol Exchange), getting adoption is challenging.

It takes time and dedication to creating something that really improves the life of the scientist (the carrot is usually better than a stick). It also takes time to connect with the funder and publishers that then can encourage the scientists to participate. In the case of protocols.io, contributions from scientists were indeed minimal a year ago, with 5-10 new public protocols each month. However, for the last 5 months, we’ve been growing at 50-100 new public methods. As we connect with more and more journals and links to us from published papers increase, the participation should become even stronger.

We don’t seem to have any bug reports of the site being down recently. Not sure what happened. But in general, to ensure that the protocols are always accessible, we will soon be mirroring all public methods from protocols.io with the Center for Open Science.

“we need a new term for these works that are essentially irreproducible, but not invalid.”

What about “provisionally true,” meaning that the scientific community trusts these works as true, subject to further confirmation. This is the phrase that is used in sociology of science to describe how experimental observations eventually become accepted as fact (or Truth, with a capital T).

The feasibility of reproducing results is very field dependent. There are experiments that are in principle reproducible, but it would require a budget a million times bigger than what you have. There will nerve be many Large-Hadron-Colliders, or gravitational wave detectors around.

The policy of scientific journals is also relevant. I have seen very many articles (in the capacity of author, referee or editor) rejected with the argument “been done before”. There are very few cookie points available for reproducing somebody else’s result, albeit the scientific value might be real.

Issues at hand:
a) research papers are supposed to be “peer reviewed”. How do these papers escape the reviewers’ validation? What is the responsibility of the editors, reviewers and even the publishers?

b) the publish/perish pressures along with claiming rights push for a “rush to print” and simultaneously mitigate against expense in time and gold to independently validate unless current research hangs on the validity of the conclusions of the published research and the need to use the results as a step forward in the current efforts. Of course, issues of proprietary control also play here whether for prestige, advancement or fiscal benefit.

c) Interestingly, metrics such as impact factor play here. An article that gathers page views, pro or problematic, benefit the publisher as well as the original authors. Publishers can not absolve themselves.

I strongly agree that reproducibility will be improved via better sharing of research protocols and materials. Indeed, that is one of the major lessons of our Reproducibility Projects in Psychology and Cancer Biology (e.g., https://elife.elifesciences.org/collections/reproducibility-project-cancer-biology; https://osf.io/e81xl/wiki/home/).

However, I strongly disagree with your statement “It is always preferable instead to do a new experiment that tests the original experiment’s conclusions, rather than just trying replicate it, … The benefit of this type of approach to reproducibility is that it offers potential for uncovering new knowledge, rather than just repeating something already known.” The main problem is “already known”–this ignores the uncertainty of evidence and the potential for false positives. The mindset of published=true is probably one of the most significant contributors to reproducibility challenges. The approach you prefer is called conceptual replication as contrasted with direct replication. Both are important, but they serve different purposes. As we wrote in Nosek, Spies, & Motyl (2012):
“Because features of the original design are changed deliberately, conceptual replication is used
only to confirm (and abstract) the original result, not to disconfirm it. A successful conceptual replication is used as evidence for the original result; a failed conceptual replication is dismissed
as not testing the original phenomenon (Braude, 1979). As such, using conceptual replication as a replacement for direct replication is the scientific embodiment of confirmation bias (Nickerson, 1998).”

Three papers that go into further detail:

“Restructuring Incentives and Practices to Promote Truth Over Publishability”: http://journals.sagepub.com/doi/abs/10.1177/1745691612459058

“Estimating the reproducibility of psychological science” http://science.sciencemag.org/content/349/6251/aac4716

“Making Sense of Replications”: https://elife.elifesciences.org/content/6/e23383

Hi Brian, thanks for the terminology. My argument is based both in the practical and the philosophical. For the practical, I get more career credit (jobs, funding) for being someone who discovers new things, rather than someone who repeats things that others have done. One can certainly argue that confirmation is important, but in the end, we do science to learn new things. If validation was all we cared about, we’d stop doing new experiments and just repeat what we already know. While credit should be given to those who replicate the work of others, it won’t (and probably shouldn’t) be the same amount of credit granted to those who make the intellectual leaps that drive knowledge forward.

From the philosophical point of view, science progresses more when I add a new piece to the puzzle, rather than just creating a copy of a piece that we already have. So, if I can do a new experiment that adds new knowledge that at the same time validates and verifies your conclusions, then that to me is preferable than just repeating what you did.

The caveat you raise, essentially “you didn’t do the experiment right/the right experiment” can also be raised for a direct replication. A failed direct replication does not invalidate the original because one can always claim that the replicator messed up or wasn’t technically good enough to make it work. Further, in my experience, when one does a conceptual replication that fails, it usually leads to one doing a direct replication as a control — did my conceptual replication fail because of poor experimental design or because the original claims are wrong.

Agree with your first and third paragraph. The second repeats the untenable claim — “copy of a piece we already have” — that assumes the original result is true. Uncertainty of evidence means that we don’t “have it” yet, and direct replication is a means of reducing that uncertainty. I’d be interested to your reactions to the papers I linked that unpack these issues in much greater detail (but no expectations that you spend time on them!).

You’re right, that’s probably poorly phrased — maybe “copy of a piece we provisionally have” would be better.

I think that authors should have to upload photos of their laboratory protocols, scribbles included. The printed methods are just not how science is performed. A few labs maintain websites that include key methods from their labs. Protocol and method journals should be used more often. I also totally agree about the Amgen debacle. It can take a PhD student 2 or 3 years to become proficient enough with a technique to generate quality data. They live and breathe the experiment and their careers depend on it. That level of motivation is often what leads to those great experiments. A group trying to do that 53 times over? Seriously? It would have been great has Amgen performed only 6 experiments and published all of their data (at the same time as the news articles). That would have been useful.

Welcome to the life of many scientists! Not only can it be hard to replicate someone else’s results, sometimes you can’t replicate your own! Culture room light bulb gets replaced, the weather changes, your consumables provider changes
etc. Detailed protocols and video of methods help, facilitating collaboration and communication helps. Accepting the validity of studies that show reproducibility helps. Including more bench scientists in this debate helps!

For much of my career, I conducted studies that were under-powered (because my understanding of statistical power was lacking). I was not unusual in that regard. When power is low, even real effects are hard to replicate. When a low-powered study yields a significant effect, the effect obtained is very likely an exaggeration of the true effect size. Consequently, other similarly powered studies are unlikely to replicate the initial finding (even if there really is an effect). In my lab, we often made conjectures about such failures (e.g., maybe it was the RA who ran the study, or maybe those summer subjects are weird) but now I believe that mostly the inconsistency was about lower power. If you haven’t watched Geoff Cumming’s Dance of the p Values, Google it!

Video-based publication is the best answer to increase the reproducibility of laboratory experiments. Video provides step-by-step visual demonstration of the technical details and allows quick learning of the methods. Everyone who worked in the lab can tell you how important it is to see an experiment, instead of reading about it. We implement it in JoVE, and get very good usage and feedback from scientists.

Text protocol journals (Current Protocols, Springer Protocols, Nature Protocols) and text protocol repositories (Protocols Online) have been around for decades, they provide a detailed text description of methods, yet we don’t see a significant improvement in the reproducibility. I agree that, in the absence of video, having the text protocols is better than nothing, but they are not a solution.

Compare for yourself the difference between the video and text protocols:
Video: http://www.jove.com/video/1938/the-subventricular-zone-en-face-wholemount-staining-and-ependymal-flow
Text: http://www.jove.com/pdf/1938/jove-protocol-1938-the-subventricular-zone-en-face-wholemount-staining-and-ependymal-flow

Disclaimer: I am the CEO and co-founder of JoVE (www.jove.com) , the science video journal.

Can we nibble away at the edge of the methods reporting problem by improving the reporting of search strategies for systematic reviews and meta-analyses?
The status quo is not good. Few systematic review authors include complete search strategies for all the databases that were searched; some don’t include any complete search strategies. If anyone thinks there’s enough detail in statements like “databases X,Y, and Z were searched using relevant terms such as A, B, and C” — find a librarian and ask why not.
But shouldn’t it be easier to document and replicate the methods of systematic reviews than many other experiments? We’re talking about textual database queries. There are no special reagents, no differences in individual technicians’ techniques, no need for expensive equipment or hard-to-recruit subjects. Documenting systematic reviews in sufficient detail as to allow for replication should be easy, relatively.
The culture change — persuading stakeholders to create better methods documentation — might also be easier in this domain. Systematic review authors presumably understand that complete reporting is valuable, since they depend on the authors of studies they cite to properly report their own findings. They already have standards for reporting that address not only the search methodology but also the eligibility criteria, study selection, data extraction, and risk of bias assessment. If systematic review authors aren’t well-disposed towards efforts to improve reporting quality, then no one will be.
So – can authors of systematic reviews and meta-analyses be persuaded to report in sufficient detail to allow replication? Will journal editors refrain from publishing systematic reviews and meta-analyses that don’t include such detailed reporting?
And most important — if we can’t achieve high-quality methods reporting in a well-suited domain like systematic reviews, is there any chance of improving methods reporting in other types of science?

Bravo for this post. In everyday life, people around the world look for video instructions before print whether it is working in R, lab or field techniques, fixing a ski binding, or assembling furniture from IKEA, which is provided with infamously terse methods sections.

My one experience with JoVE was very positive and I hope this journal gets more traction. They’ve thought it through and produce high quality methods “articles.” It’s starts with a conventional peer reviewed text manuscript, and once accepted they worked with us to write a storyboard to line up a a professional videographer to come out to do the shoot, and they edited. Amazing amount of work on the part of the JoVE personnel in comparison to a standard print article, not the mention the hands-off approach from the megajournals.

The downside or limitations to JoVe are that this is a high bar, and more than what the majority of authors are willing to take on. The logistical coordination between getting the experiment up and running and the journal’s videographer scheduling was nontrivial. And expanding it to field work in the environmental or earth sciences is another logistical challenge. A further downside is cost. JoVE’s quality doesn’t come cheap, and my library balked at the subscription cost. Authors can’t share a hosted video like they can a PDF, so to me, the JoVE approach really only works for Open Access articles. My recollection was that the JoVe OA fee was about double that of the typical Open Choice etc. fees from Wiley/Springer/Elsevier. For all that JoVE provided compared to the former, this was excellent value in my view. Still the money had to be found, and I suspect altogether, an open access JoVE article might seem too much trouble and expense for most authors and projects.

Alternatively, authors can produce and publish decent videos themselves. If the quality of amateur video equipment and high resolution vide shot at all those dreadful children’s school music or dramatic performances could be replicated in the lab with a careful narration of the procedures, it could go a long ways toward improving procedural replication. Even smartphones produce decent quality video with some attention to lighting, stabilization, and ambient noise. Free editing software is serviceable and easy to use. Adding a methods video as Supplemental Information is directly supported by most publishers, and if not there are plenty of other ways to link to it. At the minimum, decent still photos of experimental setups or field sites go well into SI, and can go a long ways for properly illustrating methods. I’ve done that in most of my recent articles, but I’ve had mixed reviews for doing so. More than one snooty comment was to the effect that lab/field photos were more appropriate for inclusion in a presentation but that the article’s SI should be for science results.

Chris,

Thank you for your kind words about JoVE.

About the costs. Most scientists are not professional film-makers, so we (JoVE) have to produce videos for them, in their labs, at multiple locations around the world. This required us to create and maintain an enormous infrastructure including:
– 100s videographers in 25 countries around the world
– video-editing team of about 30 people
– team of about 40 script-writers and science reviewers with PhDs in specific fields

We apply a flat author fee $2,400 for each video article for all the work we do: filming at your lab, travel of our videographer to your location, scripting and video-editing, in addition to the traditional expenses of science publishing such as editorial and peer-review management. For comparison, for my last academic article in a text-only subscription journal (PNAS), I paid the same $2,400. So our author fees are comparable to the text only journals, despite much higher expenses.

To be sustainable, we apply a subscription-based model. At this time, about 1,000 academic institutions subscribe to JoVE, and more are coming. If your institution is still not subscribed, please speak to your librarian. I am confident they will listen. For authors wishing to publish in Open Access, we provide such an opportunity for additional fee $1,800 to at least partially recover our expenses.

I am glad to hear you find value in JoVE.

Moshe

Perhaps journal publishers should publish a companion journal of those methods used in the “descriptive/results” journal. Both journals could or could not be OA with the former requiring the author to pay for the publication of two articles. I am sure funding agencies would agree to such a proposition just for the sake of making science available to all. Of course subscription based journals would have to double the subscription rate.
Oh, how I hate the laws of economics!

I don’t know about other fields, but research on truly “irreproducible” one-off events in any physics-based discipline (high-energy, astro, meteorology) is built on solid theoretical background. Much of the observations in these fields can be predicted (calculated) from first principles, so unless there is something very wrong with the calculations, the starting confidence levels are generally higher than disciplines like the biological sciences. Furthermore, research on these one-off events often involve large scientific infrastructure (accelerators, telescopes, aircrafts) which are usually mandated to provide clear methodology reports. More importantly, such research attract multiple independent teams that carry out their own analysis of the data, which in effect provide some measure of replication. Successful report of results also hinges on having at least two teams agreeing with each other, and even then, a third team may report a disagreement in another article.

Regarding the Amgen debacle, I personally think that it is a difference of priorities. Companies want to have robust processes (methods) which can provide clear-cut data on the hypothesis under question within a reasonable time frame. The irreproducible results simply imply that the methods used in the original papers are not robust enough to be applied to an industrial/medical setting. This may not a problem for science per se, but it is a big problem in translational research. And we all know that’s where the funding is.

You make a very valid point here, David.

One thing that has always occurred to me about the relationship between data sharing and reproducibility is that it’s a bit post hoc. The drive towards sharing more components of the research pipeline pre-dates recognition of the reproducibility crisis and has broader motivations. A big part of it, instance, is enabling other researchers to more easily build on the work.

I think it’s important to take a balanced approach here. The fact that there does seem to be a lot of research that isn’t done to a high enough standard to be reproducible and robust, doesn’t mean that every time somebody can’t reproduce a result, the original experimenter was either lax or fraudulent. Equally, the fact that there are lots of experiments that are technically challenging to reproduce doesn’t mean that there aren’t people cutting corners.

In the short term, there are low-hanging fruit, like the misuse of statistics that can be identified and hopefully cleaned up through researcher education and better training. In the long-term, making more and more of the research pipeline accessible, from methods, to lab notebooks, to analyses and code, to data, will help us refine our approaches by surfacing weaknesses, or at least help us not get any worse.

Might also be worth noting that this is an inherent problem for a lot of (non-experimental) social science work. It simply cannot be reproduced because you cannot step in the same river twice. The passage of time at a field site inevitably changes it so a new investigator cannot expect to see the same things happening. Even asking participants about their experiences at a prior time runs into problems of recall and reconstruction – the controversy over Derek Freeman’s questioning of Margaret Mead’s fieldwork in Manus is an obvious case in point.

Comments are closed.