Mention the name Brian Wansink around Cornell and you’ll make a lot of people uncomfortable. Wansink, a food researcher, was known by his innovative experimental designs and headline-grabbing results. He was also exposed as someone who put good stories ahead of good results, investigated for “p-hacking” his data, and unceremoniously left Cornell in 2018. While his departure was more than four years ago, the damage he did to the reputation of his school and colleagues is still raw. For top-research institutions and researchers, reputation is everything. And reputation rubs off in both directions.
p-hacking is a term used in the quantitative life and social sciences where a researcher analyzes data to ensure a statistically significant result. In practice, it can mean using inappropriate statistical tests, excluding data points, collecting lots of variables — in the hope that some associations will come out significant — among other inappropriate behaviors. Fear of p-hacking is the very reason why rigorous medical journals require authors to register their experiments in advance and follow a strict protocol for reporting their results.
In a previous post, I described how a recent study on the effects of Twitter on paper citations violated several of the journal’s rules, by failing to describe how sample sizes were calculated, omitting how readers could access the paper’s underlying data, and refusing to answer questions about the validity of the paper’s findings.
On 13 June 2022, more than a month after my initial request, the editorial office of the European Heart Journal provided me a link to the authors’ data (as of this writing, there is no public link from the published paper). The data were provided in .dta format — a proprietary format for users of a statistical software called Stata. Nevertheless, with the help of a Cornell data librarian (go librarians!), I was able to access the data in a generic format that allowed me to verify the results reported in the paper.
The statistics for analyzing and reporting the results of Randomized Controlled Trials (RCTs) are often very simple. Because treatment groups are made similar in all respects with the exception of the intervention, no fancy statistical analysis is normally required. This is why most RCTs are analyzed using simple comparisons of sample means or medians.
Deviating from this norm, the authors of the Twitter-citation paper used Poisson regression, a more complicated model that is very useful in some fields (e.g. economics) when analyzing data with lots of independent variables. However, Poisson regression is limited in its application because it comes with a big assumption — the mean value must equal the variance. When this assumption is violated, the researcher should use a more flexible model, like the Negative Binomial. Using an inappropriate model on data can result in unreliable results, which is exactly what I found with the Twitter data.
…the only test that provided a statistically significant finding was the one where the model was inappropriate for the data
Using Poisson regression on their data, I got the same results as reported (12% citation difference, 95% confidence interval 8% to 15%, p<0.0001), which appears to be a robust and statistically significant finding. However, the model fits the data very poorly. When I analyzed their dataset using a Negative Binomial model, the data were no longer significant (13%, 95% C.I. -5% to 33%, p=0.17). Yes, the estimate was close, but the confidence interval straddled zero. Using a common technique when dealing with highly-skewed data (normalizing the data with a log transformation) and employing a simple linear model also provided non-significant results (8%, 95% C.I. -7% to 25%, p=0.33). Similarly, a simple comparison of means (t-test) was non-significant (p=0.17), as was the non-parametric (signed-rank) equivalent (p=0.33).
In sum, the only test that provided a statistically significant finding was the one where the model was inappropriate for the data.
The authors didn’t register their protocols or even provide justification for a Poisson regression model in their preliminary paper. A description of how their sample size was determined was missing, as was a data availability statement — both are clear violations of the journal’s policy. The editorial office was kind enough to provide me with a personal link to the dataset, but it is still not public. They have continued to invite me to submit a formal letter to the Discussion Forum of their journal. They have not responded to questions about how this paper got through to publication without these required elements or what they plan on doing about it. No one is willing to admit fault, and the undeclared connection of several authors with current or past EHJ editorial board roles raises questions about special treatment.
I stopped using Twitter recently because it began serving me inappropriate ads and senseless personal tweets. Elon Musk’s pronouncements on how he would change the service, if he became CEO, added to my decision. Social media companies wax and wane, and their influence in the lives of researchers can be ephemeral. In five years, a study on the citation benefit of Twitter may be heaped into the same bin as other fill-in-the-blank-leads-to-citations papers. But a tarnished reputation is deep and long-lasting. I hope the editors of EHJ understand what they are sacrificing with this paper.