Reproducible Research: A Cautionary Tale

Access to data is the new frontier for scientific researchers. Governments, funding agencies and journals are increasingly calling for researchers to provide public access to their data. More access to more information is a good thing, and over time, the current controversies over these policies (embargo periods, patient confidentiality, archive costs and maintenance, etc.) will be worked out, and releasing at least some portion of one’s data, post-publication, will become just another part of the research process. While one of the goals for these policies is to increase verification and reproducibility of research, successfully doing so is going to vary wildly depending on the data and the types of experiments performed.

Some types of data, and areas of research lend themselves readily to reuse and rapid verification. It’s no surprise that computational research and areas like bioinformatics, are leading the charge for data access. If your experiment consists of running numerical data through an algorithm, then releasing your data and your code allows others to quickly verify that you’ve done what you’ve said you’ve done. But when it comes to other types of research, wet bench experiments or observational work for example, reproduction is not quite so simple.

You’re probably familiar with attention-grabbing articles from Amgen and Bayer, claiming that the majority of research results are irreproducible. The fact that drug development companies, often working at a fast pace and under economic pressure, could not reproduce the findings of complex experiments should perhaps not be surprising. Research, particularly cancer research is incredibly complex and one must contend with a near-infinite number of variables. Scientists literally spend years developing and mastering the difficult and esoteric techniques necessary for their experiments and the slightest environmental or methodological variance can produce very different results.

In a recent article in Cell Reports, Lawrence Berkeley National Laboratory breast cancer researcher Mina Bissel and colleagues offer a cautionary tale about how difficult reproducibility can be. Bissel’s lab set up a collaboration with a group in Boston. The groups were working with breast tissues, and as part of their experiments, these tissues were broken down and the resulting cells sorted by type using a method called Fluorescence Activated Cell Sorting (FACS). But the two labs, working with the same tissues and the same protocols, could not get similar data sets from their FACS techniques no matter what they tried. After a year of painstakingly breaking down every aspect of the process it was discovered that one small mechanical detail of the technique, essentially how rapidly the tissues were stirred during breakdown into cells, was the culprit.

Bissell and her collaborators deserve an enormous amount of credit for their doggedness and attention to detail. Many researchers would have given up at some point, and I suspect that’s more than likely what happens when a company like Amgen or Bayer encounters a problem like this. In the Amgen article cited above, they tried to reproduce 53 different studies over a 10 year period. While the details are not given, I assume it is unlikely that Amgen devoted the time and resources of entire research groups to spend a year focused solely on the minutiae of each of these 53 experiments. But that’s what it may take to reproduce some results.

A former colleague shared the story of his laboratory and several others who shared a tissue culture technique but couldn’t reproduce each other’s findings. Again, after nearly a year of delving into the details, it was discovered that the collagen they had all ordered from the same supplier had enough variability between different lots to fundamentally change the results seen. I know at least two laboratories that moved to new universities and couldn’t reproduce their previous results and both, after much perseverance, found that minute ionic differences in the local water supply altered their solutions. I’ve seen laboratories where the cells in the dishes kept near the top of the incubator were different from those at the bottom due to vibration from an internal fan.

This is the life of the cell biologist, and working with cancer cells adds further complexity. A cancer cell researcher recently offered the following, suggesting that varying results are the norm, not an anomaly:

In the specific context of cancer cell biology, there is an additional factor that needs more thought and discussion as far as reproducibility of experimental results. Many investigators strongly favor the use of human cancer cell lines, but most of the common cancer cell lines have chromosomal abnormalities. For example, individual MCF-7 cells cells can each have a wide range of duplications of chromosomes, resulting in different genetic content within each cell. So, even assuming that everyone is conducting their experiments perfectly and that everyone has “authentic” MCF-7 cells, they could still be analyzing cells with chromosome numbers anywhere from 44 to 87. As concerning as the variation in the mean chromosome number is the large range of observed values. Every experiment is therefore conducted on a hugely diverse population of cells and selective pressure can bring different variants to the fore. Change the experiment, change the selective pressure, and you have different cells from the “same” cell line.

Layered on top of this genetic variation is phenotypic variation. Tumors are now recognized to consist of cancer cells in very different phenotypic or differentiation states, even from the same genetic background. So the cellular material for cancer cell line experiments is highly heterogeneous. An additional complication is that most common assays (e.g. scratch assay, transwell migration) don’t have a large dynamic range. Even “big” effects are typically around 2-fold up or down. Given that there are many cell behaviors that could impact these assays and the hugely variable cellular inputs, what is reasonable to expect in terms of reproducibility?

And that’s just one sub-field of cell biology. I’m sure you can find similar comments about complexity and variability from any number of researchers in any number of fields.

The notion then, that public access to researcher data is a magic bullet which can end questions of experimental reproducibility, is naïve at best. Given that the data access movement has been primarily driven by the computational research world, where reproducibility can be achieved by re-running the data, it’s an understandable assumption but it also falls victim to a myopic worldview where all researchers work the way one’s own field works.

While broad data availability is worth pursuing, we should temper our expectations for the benefits it offers. It is likely going to have greater value in some fields than others, and for some experimental methods than others. We should also take great care when interpreting replicative studies. A positive result, showing that results can be reproduced, offers meaning. A negative result, however, does not immediately invalidate previous findings because we cannot differentiate between an initial experiment that was wrong and a failure by the second group to handle all of the complex details of variability.

David Crotty

@davidacrotty

David Crotty is a Senior Consultant at Clarke & Esposito, a boutique management consulting firm focused on strategic issues related to professional and academic publishing and information services. Previously, David was the Editorial Director, Journals Policy for Oxford University Press. He oversaw journal policy across OUP’s journals program, drove technological innovation, and served as an information officer. David acquired and managed a suite of research society-owned journals with OUP, and before that was the Executive Editor for Cold Spring Harbor Laboratory Press, where he created and edited new science books and journals, along with serving as a journal Editor-in-Chief. He has served on the Board of Directors for the STM Association, the Society for Scholarly Publishing and CHOR, Inc., as well as The AAP-PSP Executive Council. David received his PhD in Genetics from Columbia University and did developmental neuroscience research at Caltech before moving from the bench to publishing.

Discussion

35 Thoughts on "Reproducible Research: A Cautionary Tale"

My read on the reproducibility issue is that it is a conceptual confusion. The core concept is that science is based on observation, but this has morphed into what amounts to an expensive audit trail requirement, to no purpose. Are half of the scientists going to devote their lives to checking the work of the other half? Surely not, but then what is the ratio that justifies providing the detailed data for every bit of work done? When something really needs checked it will probably be checked. Universal backup data for all work is not required.

By David Wojick
Mar 26, 2014, 7:31 AM

In my opinion, scientists do not check the reproducibility of other’s research only just to check it, but to take advantage of the method. In case of Haruko Obokata’s et al work published in Nature in Jan 2014, who claimed to be able to reprogram adult cells back into pluripotent stem cells, a lot of stem cell scientists tried her method in the hope to get pluripotent stem cells for regenerative medicine, but no one until recently has succeeded. These unsuccessful attempts to reproduce Obokata et al’s result has lead to a hullabaloo in the world of stem cell scientists, and thus investigation of the article, which found issues that might represent scientific misconduct.

By Jeanne A. Pawitan
Mar 26, 2014, 10:08 AM

Indeed Jeanne, this is a good case of science trying to reproduce an unexpected result. But it is not a case for universal data preparation and submission. Researchers have better things to do.

By David Wojick
Mar 26, 2014, 11:18 AM

I don’t think the point is so much reproducibility as other researchers using the data for further analysis. I just had to fill out a “data management” section in a grant proposal here in Australia. It’s a new requirement this year. This is despite our study not producing any primary data, just processing existing datasets (mainly from governmental agencies). In my field (economics), journals are increasingly asking researchers to submit data. It can be useful for understanding how to use a technique you haven’t used before to apply it to a known dataset and see if you can reproduce the results.

By David Stern
Mar 26, 2014, 8:38 AM

That stuff might be useful some day is the reason for archives, but collecting everything is not reasonable. Universal data preparation and curation looks more like a fad than a rational policy.

By David Wojick
Mar 26, 2014, 9:51 AM

David, these are good points that you don’t normally hear in the Open Data debate. I applaud Bissel et al for publicizing complications related to data sharing their area of research.

However, I don’t think that what they found means we can’t or shouldn’t share even this data openly. Instead, we should continue to improve our means for capturing our methods–including their minute divergences from the ‘script’–in an automated manner, taking the onus of creating the “audit trail” off of researchers.

We should also be more specific about our own methods when writing them up in the literature (in ways like the Reproducibility Initiative suggests), to help avoid the problem of “I couldn’t validate that experiment because I’m not using cells from the same exact lot as my colleagues”.

By skonkiel
Mar 26, 2014, 10:14 AM

I agree completely–there are important reasons for making data available other than reproducibility, but where that particular use is concerned, we need to be realistic. I was the Editor in Chief for a biology methods journal for several years where we tried to address the problem of labs not sharing their detailed methodologies by offering them a chance to publish a peer reviewed paper describing their protocol which would count on their CV toward career advancement and funding.

Even with that incentive, getting authors to spend time writing up their methods was like pulling teeth, and nearly everything we published was commissioned rather than spontaneously submitted. So I think there are still some cultural issues in the research community that need to be overcome.

By David Crotty
Mar 26, 2014, 10:25 AM

Interesting David C. What you apparently see as cultural issues that need to be overcome I see as a rational decision not to spend unnecessary time preparing data that is likely never to be used.

By David Wojick
Mar 26, 2014, 11:14 AM

I would suggest there’s a difference between making data available and writing up a detailed protocol for a new methodology that others may wish to use in their own experiments.

By David Crotty
Mar 26, 2014, 11:17 AM

Which, if any, do you see as the proper focus of government mandates?

By David Wojick
Mar 26, 2014, 11:22 AM

David,

As a publisher of the methods journal (JoVE), I agree that publication of data (results) is much more attractive to authors than publication of methods (how results are achieved). But I think the issue is economical rather than cultural, specifically, how the science funding is structured. Scientists get grants for new results (e.g. a new cancer gene discovered), and not for demonstration to other scientists how these results were achieved. The funding agencies put the priority on the results, and so far were less interested to think about the productivity of the scientific work. The Amgen and Bayer stories provided the first systematic data to reconsider this approach.

By Moshe Pritsker
Mar 26, 2014, 1:02 PM

With all due respect to economics, grants, career incentives etc, most of us working scientists are in the business because we are curious about the results of our experiments, and not about the methods used to get them. ‘Tis the simple difference between the end goals, and the means to those ends….

By Mike_F
Mar 26, 2014, 1:47 PM

I think the first sentence in your comment addresses 2 different issues. One-by-one:

>With all due respect to economics, grants, career incentives etc,
>most of us working scientists are in the business because we
>are curious about the results of our experiments

This may be true until your grant application gets rejected. Scientists need to eat too, and therefore are responsive to economics and incentives as any other professional group.

>we are curious about the results of our experiments,
>and not about the methods used to get them.

This may be true until you need to reproduce a paper published by another scientist (and you often need it if you work in the lab). Then you care very much about how they did it.

By Moshe Pritsker
Mar 26, 2014, 2:53 PM

Moshe and Mike, I think you’re both right here and you’ve both surfaced what I was referring to as “cultural” issues.
The culture of science does focus heavily on streamlining activities down to just those that provide career advancement or funding. Time pressures are so intense that little else can fit in.
So, spending time writing up a detailed protocol to help others often doesn’t even occur to a researcher.

By David Crotty
Mar 26, 2014, 3:40 PM

+1 “So I think there are still some cultural issues in the research community that need to be overcome.”

By skonkiel
Mar 26, 2014, 5:23 PM

Biological field sampling in streams always has been a problematic area because there are so many variables. But new technologies can help. Set up a GoPro camera on the bank (or strap it on!), and you’ll record more information about your sampling technique than you’d ever put in a journal article. And, the video file can be archived.

By Ken Lanfear
Mar 26, 2014, 12:10 PM

Ken,

This is why you have JoVE (wwww.jove.com), the peer reviewed video journal devoted to visualized publication of research methods. Video is a much better medium for documentation of complex how-to technical processes than traditional text. We publish 70 video articles (indexed in Medline/PubMed) per month filmed at the labs of leading research universities. Every day we receive numerous requests for our content from scientists and students who suffer from this painful problem of reproducibility. Everyone who worked in the lab knows what I am talking about. I suffered from this myself when I was doing my Ph.D., and this is how the idea of JoVE was born. Already 650 universities and colleges subscribe to JoVE at this time.

By Moshe Pritsker
Mar 26, 2014, 12:44 PM

I think video can be incredibly valuable for reproducing some complicated techniques like dissections and physical manipulations. But time constraints require that some generic parts of a protocol be left out of the video, lest it end up being hundreds of hours long.
For example, a video protocol is likely going to omit making a 1M NaCl solution. Pour in the salt and stir. Let’s watch it stir until it goes into solution…
So for this experiment, would a basic part of the method (stir to break up the tissue) have been included?

By David Crotty
Mar 26, 2014, 3:45 PM

Good question. This is how we solve this problem. Due to the popular demand for basic experiments, and to avoid their inclusion in every research video article, we created a video database of the most common research procedures (e.g. PCR, DNA gel, making solutions, Western blot, etc….). We call it JoVE Science Education. Here is an example of the DNA gel procedure:

http://www.jove.com/science-education/5057/dna-gel-electrophoresis

Note the combination of animation to explain the concept and video to demonstrate how-to. This database was released 4 months ago, at the end of 2013, and more than 100 institutions have subscribed to it so far. We receive a lot of requests to its content, especially from teaching professors and students, on a daily basis.

By Moshe Pritsker
Mar 26, 2014, 4:56 PM

But for the problems described here, you’d have to have everyone using the same generic protocol, no variations. Not to mention the same lots of the same reagents, the same equipment in the same state from the same manufacturer. Even the same water source.

By David Crotty
Mar 26, 2014, 7:04 PM

Well, reproducibility means using the same reagents and equipment. This is one the first things a scientist does in the lab when a published experiment does not work in their hands – they begin to buy same reagents. I think the proper word would be “standardization”.

Maybe the modern science, especially biology, needs more of such standardization. This is because too much time and money are wasted today when Ph.D.s spend their time struggling with trivial details of lab protocols instead of looking for cancer cure.

Practically, it is challenging to account for all variations in equipment, reagents and protocols between different labs in any publication. But at least we can and should provide an effective format to properly document all these details. I believe such format is the combination of video, animation and text. Implementing this format may not bring reproducibility to 100% but will raise it from 10-30% found in the Amgen/Bayer studies today (crazy number, right?).

For instance, in the examples you provided, the difference in the tissue preparation for FACS could be easily documented by video. The difference in the collagen provider could be easily found if the manufacturer was mentioned in the text protocol. So systematic documentation in the video + text format would solve a lot of such issues.

By Moshe Pritsker
Mar 26, 2014, 8:24 PM

Does that eliminate some level of progress and serendipity though? If everyone follows the exact same path, how do you break new ground?

By David Crotty
Mar 26, 2014, 8:38 PM

David,

Thanks for highlighting that reproducibility is more than just reanalysis of data, but I have to say that of the 50 studies we’re looking at for the Cancer Biology Reproducibility Initiative, none of them are using special or esoteric techniques. It’s all basic molecular biology like you learn your first year in the lab.

I feel like so many biologists have a sense of learned helplessness about reproducibility. Small variations in starting materials or procedures can have a big difference on the outcomes, which is one reason reproducibility is poor, but we can do more than just shrug our shoulders and keep plugging away. We actually can do something about reproducibility. The same lab at Amgen that did the study you dismiss above actually published a set of best practices to encourage reproducibility, and they largely make sense, things like properly powering your studies, using proper controls, showing all your data, etc.

He wrote them up for Nature here: http://www.nature.com/nature/journal/v497/n7450/full/497433a.html
Those without access can see his presentation to the White House here: http://www.whitehouse.gov/sites/default/files/microsites/ostp/PCAST/Begley.pdf

While I wouldn’t take as harsh a stance as Dr. Begley, this means there is something we can do other than just invoke fundamental irreproducibility and throw salt over our shoulder.

Another set of reasons is that people are encouraged to publish the data which tells the sexiest narrative. The point of the Reproducibility Initiative is to understand and characterize these differences.

The Bissel lab story is a case in point: it took a year to uncover a basic protocol difference. If they had had the experiment independently replicated by a third-party, the difference would have been apparent immediately because they could have just compared protocols.

By mrgunn (@mrgunn)
Mar 26, 2014, 1:17 PM

Except that they were not able to identify the difference by comparing their written protocols and methods – they found this out only when actually doing the experiment side by side. Written protocols have a finite level of detail, and having a written protocol replicated by a third party would not have solved this case.

By Mike_F
Mar 26, 2014, 2:05 PM

Thanks William. I agree that there are some really important pathways to reproducibility that are still underutilized. A set of best practices, or setting standards for particular assays for example.

But it’s unclear to me why having a 3rd party, likely a less experienced and less expert 3rd party, replicate these experiments would have made a difference here. The issue became apparent immediately and there were no shortcuts to the serious detective work of going through all the details.

And as MikeF has pointed out, the issue here was in fact in one of the most basic of techniques, breaking up a tissue for FACS analysis. That’s likely why it took so long to track down, as the more complex parts of the assay were more obvious candidates.

I will be very interested to see what your group can replicate. But as stated in the post, if you fail to repeat a particular set of results, it does not invalidate those results because we can’t distinguish between whether they were wrong and whether your technique wasn’t up to snuff.

By David Crotty
Mar 26, 2014, 3:55 PM

The “good hands” argument is among the first to be trotted out as to why a result doesn’t replicate, but it’s a terrible argument. If only one person using one technique can ever show the existence of a marker on a cell line, what good is that to anyone? If the marker actually exists, it’ll be detectable by a number of methods, any of which could have been employed in the above case without spending a year digging around.

It’s great they did so, but given that Dr. Bissell told me on a call, with witnesses, that she can’t endorse anything Dr. Iorns does because of a prior scientific disagreement, one should question what lessons we are to take from this story.

I also think the experienced and professional core facilities that we’re using for the Initiative would be surprised to hear someone automatically assuming that they would be less experienced and less expert. These facilities have SOPs which document in detail exactly how each step is done, which are quite different from the kind of protocols used in the labs I’ve been in.

By mrgunn (@mrgunn)
Mar 27, 2014, 3:43 PM

The fact that some people are better at their job than others, that some workers are more skilled than other workers, may not always be welcome news, but that does not make it any less true. Core facilities tend to focus on common techniques–that’s their reason for existence, to provide a more cost-effective means of accomplishing common tasks needed by many at an institution. Which is fine if the assay being tested is indeed one of those common techniques.

But much scientific research involves specialized tasks, often tailored specifically for the experiment in question. As a postdoc, I spent two years creating and honing a survival surgery, trans-uterine viral vector microinjection technique for labeling mouse neural crest cells. While I was able to teach this technique to others, it took each of these very skilled postdocs about six months of regular practice to master the technique and achieve consistent results. I sincerely doubt that any of your core facilities is familiar with this esoteric technique, so in this context it would certainly be appropriate to describe them as less experienced (do they regularly do this method?) and less expert (how good are they at this method?).

If you, or Amgen/Bayer wanted to repeat my old experiments, you would need to commit to at least half a year’s worth of work as preparation and practice, before beginning the experiments. And at least in the case of the drug development companies, I suspect that such a leisurely pace is not part of their culture. It’s unclear whether your project is willing to committ to years of detective work like that described in the Bissel paper should a replicative experiment yield results inconsistent with the original.

Further, if you’re re-doing older experiments, it is likely that even standard techniques and equipment have evolved since those experiments were done. And those differences, as noted in the post above, may alter the results seen. I think Marie McVeigh’s comment below is particularly insightful. Experiments yield conclusions. Those conclusions should indeed be repeatedly tested, but there are ways to verify if conclusions are true that go beyond mere replication of previous experiments.

I won’t go into any questions of disputes between Bissel and Iorns as I have only heard the story third-hand, but Bissel’s reputation speaks for itself, as does her laboratory’s work.

By David Crotty
Mar 28, 2014, 10:11 AM

To David Crotty:

I am answering your last question above because there was no button Reply below it – maybe a glitch.

>>Does that eliminate some level of progress and serendipity though?
>>If everyone follows the exact same path, how do you break new ground?

I don’t think you want serendipity to come from variations in NaCl or collagen solutions, or configuration of the DNA gel apparatus. If you standardize these trivial factors, the complex biological system under study will still include million of variables as a source of exiting unexpected discoveries. Besides that, with a 10-30% reproducibility rate, should serendipity be a major concern?

But we both know that the standardization will not come soon for many reasons. So the massive visualized documentation of experimental procedures provides a practical solution, maybe the only one.

By Moshe Pritsker
Mar 26, 2014, 11:01 PM

Don’t get me wrong, good resources and standard protocols are of tremendous importance (though we likely disagree on how much of this needs to be done by video). But one can go too far in that direction and result in a level of orthodoxy that stifles progress. There’s a good blog article that talks about this in terms of data handling (http://drugmonkey.wordpress.com/2014/02/25/plos-is-letting-the-inmates-run-the-asylum-and-this-will-kill-them/) but the same principle, that “diversity of science makes it so productive,” applies to methodologies as well.

If you allow only one method for doing a particular assay, that ends progress as far as that assay goes. Methods must evolve and improve over time, and strict standardization halts that.

Strictly relying upon only one supplier for reagents and/or equipment creates unfavorable economic conditions (a monopoly for one company would lead to higher prices for reagents/equipment). Similarly, such a situation creates a single point of failure. If company X’s collagen is flawed in some way, that skews all research to give false results that are undetectable because there’s nothing to compare them to. Similarly, if company X is the sole supplier of collagen and they go out of business, the entire field has to start over.

Life sciences and medical research, because they involve living creatures, are inherently messy and variable because living creatures vary from one another. There is a certain level upon which one must accept that variability, as per the quote from the researcher in the post.

By David Crotty
Mar 27, 2014, 8:20 AM

Somewhere in the background of the whole discussion is this fact: science has made progress. Even though the exact replication of results is difficult, even though the public distribution of datasets is a new requirement, even though video recording experimental methods is still narrowly practiced and more narrowly published, science has made progress.

This has been possible because of the basic structure of the scientific method. An hypothesis, if true, will imply experimentally observable outcomes. You test for the occurrence of the predicted outcome – and if it’s there, you take that as supporting evidence of your hypothesis. You “control” for other factors that you think could influence the results – but you will only control for those factors to which your hypothesis would be sensitive. (If you’re looking at bone density in the forearm, you might consider whether your subjects are left- or right-handed…but you probably won’t worry about whether you tested them during a period of active solar flares; if you’re looking at astronomical events, you control for solar flares but not for handedness).

Fortunately, once you test, and control, and gather your results, that conclusion is seldom left static and is not immediately given to the canon as imperishable truth. Scientists are incurably curious – and one result will not satisfy. If X is true, then Y might also be true, let’s test for Y – hypothesis, experiment-with-controls, results. Testing the same hypothesis from multiple directions, according to its expanding circle of experimental implications. The range of knowledge expands. Repeating the same experiment – exactly – and getting the same results every single time, oddly is less able to “prove” your hypothesis than a range of slightly different experiments all of which support the same general conclusions.

It is necessary, in some circumstances, that other labs expand an area of research by establishing that they can repeat (reproduce, replicate) exactly the same result reported by a different lab. This goes, in part to David W’s position about the need to reproduce “surprising results.” Surprising results put you in territory that is not immediately adjacent to related work; it becomes necessary, then, to ensure that a new lab can re-establish the baseline of results to ensure you’re all talking about the same thing.

Science has made progress because of the drive to expand upon results. Sometimes this takes the form of re-creating the prior work (and this is so that you can build outward with certainty of your starting place), mostly it takes the form of separately confirming the conclusion by testing its further implications.

The process is not very efficient – but it has been remarkably successful. “You never learn anything from an experiment. You learn from series of experiments.” (Arnost Kleinzeller)