Sharknado poster
One crisis the scholarly publishing industry has yet to face

In the “chicken little” world of scholarly publishing, reeling from crisis to crisis is business as usual. We seem averse to the concept of steady, incremental improvement and must instead face constant impending doom. The serials crisis has become a permanent fixture in our culture, and we remain in the throes of the access crisis and the (at least perceived) peer review crisis. The data access crisis is just coming over the horizon, and it’s going to be a doozy. The reproducibility crisis and the negative results crisis are both coming into their prime, and represent an interesting conflict. While both chase similar goals–increased transparency, increased efficiency and trust in the literature–their proposed solutions seem at odds with one another.

The reproducibility crisis is based on reports suggesting that the majority of published experiments are not reproducible. One of the particular concerns is that researchers will waste time and effort trying to build on results that are not true:

There is, however, another group whose careers we should consider: graduate students and postdocs who may try to build on published work only to find that the original results don’t stand up. Publication of non-replicable findings leads to enormous waste in science and demoralization of the next generation. One reason why I take reproducibility initiatives seriously is because I’ve seen too many young people demoralized after finding that the exciting effect they want to investigate is actually an illusion.

The proposed solution to the crisis is that time, effort and funds be put toward repeating published studies, and that some sort of career credit be offered for doing so. It is unclear who will provide those funds, and if their use for replication means they won’t be spent on new experiments, slowing progress. Further, if one assumes that the most talented researchers will be those pushing the envelope, then how much should we trust the skills of those whose careers are based on the unoriginal work of repeating experiments. More importantly, those who advocate for creating new publication outlets for replication and validation studies seem to ignore that science has both built right in to the process.

The negative results crisis (also known as the “file drawer problem“) comes from the notion of publication bias, the idea that researchers never publish experiments that don’t work or that provide null results. Again, one of the concerns is that when one researcher hides this sort of work away, other researchers may waste their time and efforts:

He and others note that the bias against null studies can waste time and money when researchers devise new studies replicating strategies already found to be ineffective.

The proposed solution is the creation of a registry for all data generated by all experiments, as well as, “Creating high-status publication outlets for these [null] studies.” Problems arise with this solution as well–is writing up a failed experiment a valuable use of a researcher’s precious time? How willing are researchers to publicly display their failures? How much career credit should be granted for doing experiments that didn’t work?

While both sets of solutions are aimed at greater transparency and time savings, the contrast seems obvious. Don’t trust published positive results, spend time and money to repeat them, but trust unpublished negative results, and don’t waste your time repeating them.

There’s no easy solution here. Both approaches suffer from a similar problem–negative results, experiments that don’t work and failed replications are very difficult to interpret. Did the experiment fail to work because the theory behind it was truly incorrect, or was the methodology flawed? Or was the theory right, and the methodology sound, but the researcher just messed up and missed a decimal point or made up a vital solution incorrectly?

Similarly, if you are able to reproduce someone else’s results, that’s meaningful, but if you are unable to repeat their experiment, it’s hard to know what that means. Was their theory/methodology wrong or are you just not as good at the bench as they were? Cell biology offers myriad examples of the complexity of research specimens and techniques, and the almost absurd level of detail required for proper troubleshooting. Who should we trust?

I suspect that we may just have to accept the notion that research requires some level of redundancy. If we create a repository of failed experiments and no one will risk performing (or funding) a new attempt where others have failed in the past, then we may block important discoveries from happening. Suppose some researcher proposes that compound X will cure disease Y, but unknowingly uses a contaminated sample of X in the experiments and the cure fails. Do we prevent anyone else from testing that hypothesis and miss out on the potential cure? Given the state of research funding, how likely is it that funding agencies are going to offer grants for a project that has already been declared a failure? Is it better to let people keep plugging away at theories that seem sound, even if it does mean some redundancy and waste?

As for reproducibility, should it be a separate act required for confirmation or instead is it just a normal part of doing the next experiment? If you’re going to spend the next few years of your life building on someone else’s results, doesn’t it make sense to be certain things work the same way in your hands? I know that when I was a graduate student, I had to go back and resequence areas of published genes to find out whether the oddities I was seeing were my own errors or the original author’s. But overall, are we better off testing the validity of research results by performing new experiments to test the conclusions they offer (if X is true, then Y should happen)? That seems a way to combine progress with confirmation, rather than holding in place until everything is double checked.

There’s probably a spectrum of right answers for resolving both of these crises. Both proposed solutions here increase researcher workload, taking valuable time away from doing new experiments. The gains in accuracy must be carefully weighed against the losses in progress. Some results, particularly those which will directly impact health and care for humans may require a higher level of redundancy than less vital research areas, and warrant the extra time and expense. Overall though, scholarly inquiry is fueled by skepticism. It’s perfectly reasonable to doubt anyone else’s results, even your own results. That may require doing the same thing a few times. The question is how to make best use of those redundancies.

 

David Crotty

David Crotty

David Crotty is the Editorial Director, Journals Policy for Oxford University Press. He serves on the Board of Directors for the STM Association, the Society for Scholarly Publishing and CHOR, Inc. David received his PhD in Genetics from Columbia University and did developmental neuroscience research at Caltech before moving from the bench to publishing.

View All Posts by David Crotty

Discussion

30 Thoughts on "When Crises Collide: The Tension Between Null Results and Reproducibility"

It is not clear that these two issues are crises; on the contrary both suggest opportunities for new publications. In particular, knowing what others have tried that did not seem to work could be quite useful. Perhaps we need a PLOF (Public Library of Failure). It might be a money maker.

More broadly, all of these supposed crises are aspects of a general social movement that seeks to reform science, not just science publishing. Much of it is utopian but some of it makes sense. In no case would I call it a crisis, but hype is a big part of it, as with all social movements. Things are actually progressing pretty well.

On the other hand there are lots of people who either do see these various issues as crises, or want them to become crises. No one knows where social movements will go so you are right to engage these issues here, even if you think they are overblown. Call it crowd reasoning, which is a big part of the blogosphere, perhaps its best part.

Had to laugh at PLOF!

Not sure it sends the right message, though …

I think part of the problem here is the idea that what we call a “failed replication” is a failure. It’s not, of course: it’s perfectly legitimate new knowledge. Perhaps if it had been given a different name way back when, it would have attracted the stigma that often prevents its publication now.

Gary King from Harvard is quoted in one of the articles linked above:
““But I suspect another reason they are rarely published is that there are many, many ways to produce null results by messing up. So they are much harder to interpret.”

That’s the real problem, not some stigma–that it’s really hard to tell a legitimate null result from a screw up.

Yep, which is why it’s hard to get much out of a straightforward replication and why, as you note in another comment, it’s better to test replication by doing new tests on the conclusions drawn from an experiment rather than just directly trying to re-do someone else’s work.

Right; but what you quoted from Gary King earlier implies that you consider a positive result a priori more likely to be correct than a negative one. I hope you don’t endorse his (implied) position.

A really good question. I suppose that a positive result gives me more to work with–it’s easier to take the claims being made and see if they hold up, if they did the right controls, etc. That’s not impossible with a null result, just harder. Positive results are also perhaps easier to verify, as it’s difficult to prove a negative.

The main issue I run into with positive results revolves around overstatement–making claims beyond what the data really tells you, or confusing “sufficient” with “necessary”.

Right. No doubt you’ve seen the graph showing that a suspiciously high proportion of published results have outcomes that are just statistically significant. It’s hard not to think that there are plenty of people pumping their results into the p < 0.05 zone.

You seem to be confusing a null result with a “failed experiment”

“Problems arise with this solution as well–is writing up a failed experiment a valuable use of a researcher’s precious time? How willing are researchers to publicly display their failures? How much career credit should be granted for doing experiments that didn’t work?”

Aside from the problematic issue of the very concept of a “failed experiment” (which suggests you are looking for a particular outcome, rather than genuinely testing a hypothesis), a null result may well be the outcome of a perfectly well executed study and is a valid part of the record, surely?

Sometimes it’s hard to tell the difference between the two. My hypothesis is that X causes Y. I test the effect of X on Y and don’t see anything. Is this because X does not cause Y, or is it because I messed up my experiment in some way that prevents me from seeing any effect X has on Y? The former is a null result, the latter a “failed experiment”. But it may not be easy to tell the difference between the two.

Yes, no doubt, there are instances where negative results can provide real value to the research community. But there are also times when they offer ambiguity and confusion.

This is dangerous territory. What if you do indeed see Y in your experiment? The danger is that you then say “Ah, it’s worked” (because you were expecting Y). The fact is, that it is just as likely to be due to some kind of error in your setup as not seeing Y. But your expectation bias means you accept it as a “result” when it may be no more valid than not seeing Y.

Hence the need for appropriate experimental design and controls for that bias (“assumption controls”). But there’s a limit to how deep one can control. Are you going to do an individual control for the concentration of everyone of 100 different solutions you may use in a protocol? Probably not, but that sort of screw up can unknowingly alter your results.

David Glass does a great job of detailing the different types of controls and why you need them in his book on experimental design:
http://cshlpress.com/default.tpl?action=full&cart=14103542076097405&–eqskudatarq=1020&typ=ps&newtitle=Experimental%20Design%20for%20Biologists%2C%20Second%20Edition

Given a positive result I doubt that error is just as likely as confirming evidence. But in any case these sorts of uncertainty are present in all scientific activity, which is why we speak of evidence not proof (or should). So I see nothing dangerous about this territory per se. It is part of the game.

There’s a relevant series on NPR this week, called, “When Scientists Give Up.” Basically, it’s about the creeping incrementalism demanded by constrained funding, which is narrowing the range of possible experiments and leading to boring hypotheses. This is making a number of high-level researchers just abandon scientific research. This shows that asking talented scientists to do confirmatory research would only exacerbate these problems — constrain funding more, drive ambitious scientists away.

http://www.npr.org/blogs/health/2014/09/09/345289127/when-scientists-give-up

I don’t think that’s what we want.

I don’t think the idea is so much to make people do replication studies for the sake of replication; more than when someone tries to replicate for other reasons (e.g. as the first stage of their own work) and that replication doesn’t show what’s expected, then they should publish that result.

Really, not publishing because you didn’t get the result you wanted would be pretty poor stuff.

Precisely! This was what I meant by “dangerous territory” (above) – the idea that a “positive” result might be accepted more readily than a “null” result and potentially more likely to be published.

It’s not a question of “not publishing because you didn’t get the result you wanted,” as that, at least in my experience, isn’t the way most scientists work. Serendipity is one of the great joys of being a researcher. I worked in a spermatogenesis lab and because one of our experiments didn’t give us the result we wanted, we became an enteric nervous system lab (and published the unexpected result on the cover of Nature). I’ve seen top labs turn on a dime when they got results that contradicted their hypotheses. That’s how good scientists work, you poke around and you see where the data takes you.

But your first point is really important, and why I have a hard time with paying researchers to replicate the work of others without adding anything new. Every result opens up new questions. Part of answering those new questions relies on the first result being correct. If it’s not, that should become apparent as you work on the new experiments. It seems much more productive to me to keep driving forward with that confirmation built into the scientific method. And hopefully that way even a failure to replicate someone else’s work opens up new ideas and pathways.

Kent: I am not too sure that the goal is not to have an abandonment of science. After all 46% of the US population believes in creation science! Lamar Smith house chair of the science and technology committee rejects climate change and evolution.

There are many complex issues hidden here, but it’s an interesting rough point. As someone else suggested, someone isn’t being careful (I won’t blame you) about the difference between a null result and a failed experimen. A properly powered experiment, with proper manipulation checks to make sure that you haven’t done something stupid, should produce a positive support/refutation of the null hypothesis, and could sensibly be replicated, anything else can’t. But scientists generally know this, and there are larger problems, as you allude to, are time/money, and pure irreplicability, that result, for example, from non-enclosable sources. For example, suppose that you use google searches, or the current state of the arctic climate, or any other thing that is basically irreplicable and uncapturable? As with all real world activities, science has to deal with irreplicability (and lack of time/money) as best it can, and muddle through. It seems to be doing okay so far.

The “Emperor has no clothes” paper can be hugely interesting and valuable. If “everybody knows” A correlates with B, research that finds no correlation might start the ball rolling for a whole new paradigm.

Comments are closed.