Do better journals publish more rigorous statistical analyses? Recently, an Italian team sought to find this out, in an article published in PLoS ONE. Unfortunately, the article itself is a prime example of what it argues against — shoddy analysis, lax statistical approaches, and a data shell game.
The title of the paper is non-descriptive and provocative — “High Impact = High Statistical Standards? Not Necessarily So.” The paper lectures the reader on how to properly do null hypothesis testing, but itself does a poor job of rigorously testing its own null hypothesis.
The problems begin when authors use inclusion criteria that make little sense. Worse, the authors hide their actual data inside percentages, a very misleading presentation choice. And because the PDF version of the article only includes these percentages, it feels like the authors are intentionally misleading the reader. Only when you look at the online data supplements does their ruse become apparent.
The nine journals included were the New England Journal of Medicine, Lancet, Nature, Nature Medicine, Nature Neuroscience, Science, Journal of Experimental Psychology-Applied, Neuropsychology, and the American Journal of Public Health. Articles from 2011 meeting the inclusion criteria were entered into the dataset.
Despite this motley assemblage of journals, the authors present all the data in the PDF version of the article (which does not include supplemental data) in charts that rely on percentages. This creates a false impression of comparability. In addition, Nature, Nature Medicine, and Nature Neuroscience are lumped together in the charts under the title Nature, which leads to individual journals being compared to a group of journals.
Using inclusion criteria like “all articles related to psychological, neuropsychological and medical issues,” you’d think it would make sense to limit the candidate journals to those that publish many of these papers, so that you have a set of comparables and a better chance of achieving statistical validity. But the authors included four journals which publish very little research like that described and which do not use the same statistical guidelines as the others. These four journals are represented as two journals in the charts.
The authors were testing whether these articles included confidence intervals, effect size, power, error bars, and other statistical elements, with the notion being that high impact factors should correlate with rigorous statistical presentations. In this case, “rigorous” equates to “busy” or “complete.”
Looking into the supplementary tables, the problem with the inclusion criteria for the articles (and, by extension, the journals) becomes clear, especially in Table S2, which lists the number of articles included from each journal:
- NEJM — 173
- Lancet — 122
- Nature — 5
- Nature Medicine — 9
- Nature Neuroscience — 23
- Science — 24
- American Journal of Public Health — 147
- Neuropsychology — 75
- Journal of Experimental Psychology – Applied — 30
Three journals have more than 100 articles in the analysis, while two have fewer than 10. Even clumped as Nature, the three Nature journals have only 37 articles included between them — less than half the number coming from Neuropsychology, and far fewer than the major medical journals.
There are plenty of uncontrolled variables in this research, aside from journals that aren’t really comparable in the amount of qualifying research they publish or their field of focus. For instance, large general journals often publish articles that are shorter in length, and this lack of verbosity may keep long-winded statistical analyses from being published. Specialty journals tend to publish longer articles with more detailed statistical analyses. The authors did not control for article length. No study design features were included — no differences were recorded or analyzed between epidemiological studies versus case-control studies versus double-blind randomized controlled studies. In short, the researchers have a motley sample of journals and papers indeed. It may be a “convenience sample” in the worst sense of that research euphemism.
Ultimately, even after mistreating the data and study design severely, the authors arrive at an anti-climactic and perfectly predictable conclusion:
Our results suggest that statistical practices vary extremely widely from journal to journal, whether IF is high or relatively lower. This variation suggests that journal editorial policy and perhaps disciplinary custom, for example medicine vs. psychology, may be highly influential on the statistical practices published . . .
In other words, not all journals and not all fields are the same, so expect there to be some differences in how they publish statistical findings.
Oddly, for all this emphasis on confidence intervals and effect sizes and error bars, there isn’t one statistic beyond percentages published in this paper. And if you don’t get to the supplemental data, you don’t even get raw numbers. It’s a very weak paper.
This also raises again the specter of PLoS ONE’s peer-review standard of “methodologically sound.” This paper is such a mess of biased inclusion criteria, purposeful data conflation, and hype, I wonder how this got through even that filter. As Phil Davis wrote back in 2010:
Can a scientific paper be methodologically sound, but just not report any significant or meaningful results?
This paper demonstrates that, yes, a report can produce no significant or meaningful results. I’d argue that this paper also is not methodologically sound, technically sound, or scientifically sound.
Do the better journals practice better statistical review and publish better statistics? If this paper is any indication, then the answer is clearly and unequivocally in the affirmative.
Discussion
1 Thought on "Glass Houses and Straw Men — An Attempt to Assess the Quality of Statistical Analyses Fails Its Own Test"