The scientific community needs to put less emphasis on Journal Impact Factors (JIFs), argued a bibliometrician and a group of distinguished science editors and publishers, and start embracing citation distributions. Their paper, posted to the bioRxiv, provides detailed steps on downloading article-level citations and plotting frequency histograms.
In my last post, I argued that this solution does not address their litany of criticisms of the JIF and may serve to confuse more than clarify. In this post, I describe how one can manipulate a histogram to selectively highlight or obscure the underlying data.
histograms are not mathematical formulae but data visualization tools. Without strict definitions on what elements get reported and how axes are defined, histograms can be manipulated to emphasize what you want to promote or obscure what you want to hide.
Unlike calculating a mean or median, both of which are mathematically defined, a histogram is a technique for visualizing data. There are no rules about how to bin one’s data or scale one’s axes. These decisions are left entirely to the author and their appropriateness is largely a matter of context.
Consider the following three histograms. While they look very different, they are created from the very same underlying dataset—the performance of articles and reviews published in Nature in 2013 and 2014, as measured in 2015.
In the first panel (red), I plot a simple frequency histogram of Nature papers. The vast majority of the data are jammed into the head of the distribution and most of the plot is used to detail highly-cited papers. This histogram emphasizes the long tail of the distribution.
In the second panel (blue), I dump every paper that received 100 or more citations into a single bin—this is the tallest spike at the right of the distribution. My decision to use 100 as the cut-off was entirely arbitrary. It could have chosen 50 or 75 or 250. Had I been working with lower impact journals, I might have chosen 10 or 5. I chose this cut-off because it was used in the Larivière paper. Whatever the reason, this distribution looks very different than the first panel. It emphasizes the head of the distribution and it is in this panel we reveal the performances of papers that received few citations or none at all. This point was largely obscured in the first histogram.
In the third panel (green), I decided to bin my data, a routine practice when reporting frequency data, especially when a researcher is unable to count a true value (e.g. how may journal articles did you read this month?: 0, 1-5, 6-10, 11-20, 21+). You’ll note that my decision to bin the data this way obscured the fact that there were uncited papers, as they were grouped with all papers receiving fewer than 20 citations.
Beyond manipulating the presentation of the citation histogram, I could also selectively decide what I report. For example, I could include just the performances of original research papers and reviews but exclude perspectives and commentary. After all, why should Thomson Reuters define what I report?
Whatever decisions went into creating one’s citation distribution, it can’t be overemphasized that histograms are not mathematical formulae but data visualization tools. Without strict definitions on what elements get reported and how axes are defined, histograms can be manipulated to emphasize what you want to promote or obscure what you want to hide.
While Larivière and others call on editors and publishers to create their own citation histograms — a time-consuming process that is restricted to those with subscription access to a commercial citation service — a more acceptable solution to avoid selective and non-standard reporting across tens of thousands of journals would be to ask the citation services (Thomson Reuters’ Journal Citation Report and Elsevier’s Scopus) to plot them for users. These services would create a standardized citation histogram and include within the figure a stamp of authenticity. To ensure that these histograms are not counterfeit, they would link directly back to the issuing service.
Without the citation indexes stepping up to provide standardization in distribution reporting, it is unlikely that Larivière’s call to action will result in widespread adoption. At worse, it will encourage the production of histograms that selectively highlight certain features while obscuring others.