A recent paper by John Ioannidis and colleagues is generating some buzz because it claims to have found that 1% of all scientists form what the authors label “the publishing core in the scientific workforce.” These are the scientists who publish consecutively, year after year.
According to their calculations, this 1% amounts to about 151,000 scientists. The putative downside of this concentration is two-fold: a lack of opportunity for the vast majority of working scientists, whose contributions are sporadic or drowned out, and a literature dominated by a small group of scientific researchers and hypotheses. The authors also claim it may show that the research system “may be exploiting the work of millions of young scientists.”
However intriguing the usage of the politically charged term “1%” might be, it’s important to look at the quality of the study before accepting the premise and its downstream arguments. It’s also worth contemplating whether the finding, if true, is unexpected or important.
The use of the term “1%” suggests a strawman of a sort, appropriating for sensationalistic purposes a meme based on inequity and greed. Every information media skews toward high producers — fiction authors, singers, journalists, record producers, photographers, directors, actors, data experts, you name it. Is it a bad thing that much of published sports photography comes from 1% of all professional photographers?
Scientific publishing seems less skewed, but it clearly must be skewed, as it is a media business with constraints on time, talent, and — most importantly for scientists — funding. A small proportion of people are able to work at powerhouse institutions, while only a small proportion of these do a great job of running their labs and securing grants, planning experiments and working with postdocs, and so forth. These gifted researchers produce a lot, or help to produce a lot, and get their names on a higher proportion of papers. But the first question is whether this can be measured.
The authors’ approaches generally left me scratching my head about whether their measurements can possibly be accurate.
First, there is a huge problem with author disambiguation that any study of this type faces. Is the “John Smith” in economics and England clearly not counted as the “J. Smith” who publishes in microbiology and hales from Australia?
To form the basis of their work, Ioannidis and his colleagues used the entire Scopus database in XML to analyze for unique authorship. Looking for what they call “uninterrupted, continuous presence (UCP)” in the literature, they claim to have identified authors who published every year during the period 1996-2011.
In Scopus, there are more than 15 million author identifiers. Whether each identifier is a unique author is a question the authors don’t address, in this or another paper they’ve written using the same database. By the end of the paper, the authors have made the leap, however, writing, “over 15 million publishing scientists” in their discussion and equating individual records with individual authors in table headings and so forth.
Other data sources show far fewer active scientists conducting research. World Bank data show about 8.9 million participants in research and development worldwide. If the Scopus identifiers are not disambiguated accurately (either disambiguating too often or not often enough), the discrepancy might throw estimates off. For instance, if the World Bank data were the denominator, you have doubled the percentage of UCP authors. It’s not huge on a percentage basis, but it’s double the estimate with just this one number changing. One percent changes to 2%, and the meme of social injustice falls by the wayside.
This also presumes that the population described by author identifiers is a stable one of 15 million over the 15-year period studied, which it clearly is not — for two reasons. First, there have been rapid increases in publications from Asia and other regions in that timespan, on top of more scientists coming from US and European centers. There are simply more publishing scientists now than in 1996. Second, the definition of “author” has evolved in many fields during the 15-year timeframe covered by the dataset. In many fields, what would have earned you authorship in 2011 would have amounted to an acknowledgement footnote or no mention at all in 1996. Scopus properly reflects what journal articles assert as authorship. But when this changes, Scopus changes. The stability the authors assume in the dataset doesn’t actually exist.
Scopus captures every author listed in an author list, but for collaboration groups, it only captures the group. This is another blind spot the authors fail to account for in their study. Therefore, making the leap from “identifier” to “individual” is a mistake. Some identifiers are groups, and when publishing as part of a group (something that has become more common in the 15 years covered), individuals aren’t given identifiers.
The issue of accuracy has some more straightforward concerns. In a presentation made last year, Paul Albert at Weill Cornell Medical College found that Scopus disambiguation is prone to making more records than there are people — a form of error called “splitting.” In his sample of more than 1,000 records, more than 700 suffered from this type of error. If you click on the video I’ve linked to, the relevant portion is at approximately 10:50. Albert estimates that Scopus has a 31% precision for finding the right person by name.
The authors made what strikes me as a rather feeble attempt to check the accuracy of their disambiguation — they used a random sample of 20 published authors’ names and checked their publication histories. They found that all 20 sampled identifiers reflected a specific author who had at least one paper published in each calendar year. Yet, in another place in the paper, they state that what they found in this sampling “had no major impact” on the results. The insertion of the word “major” is interesting as is the use of the word “reflected” instead of “corresponded,” especially given the large assumption that author identifiers equate 100% to individual authors. I’m not sure they had a completely clean fetch based on the slightly unclear wording, and testing for 20 out of 15 million seems less than satisfactory. Why not 100? Why not 1,000?
As an aside, in a paper that is so reliant on disambiguation, the authors themselves make a mistake in how they reference one of their own papers, getting the title wrong (reference 12). Accuracy is elusive for us all.
Then we get to the 1% of UCP authors the paper claims to have identified. It’s an interesting number the authors arrive at — approximately 151,000 names. Why is this interesting? Well, according to the US Census Bureau, there are 151,671 unique last names in the United States. The United States not only provides a high percentage of published papers, but as a culture it reflects a nice mix of European, Asian, and other advanced societies with productive scientists. The similarity of the numbers makes me wonder all the more about the data — that is, did the authors simply reverse-engineer Census Bureau findings? Last name is, after all, the easiest thing to disambiguate.
The second issue with the paper is that Scopus covers far more than journals. It includes books, conference papers, and patents. Within journals, it includes most types of editorial material — scientific papers, reviews, editorials, letters to the editor, and so forth. Many authors who publish papers might the next year be invited to write an editorial, a book review, or a review article. Many authors are replying to letters to the editor in the following year. The authors did not differentiate between new research and other editorial material in their study. UCP may simply relate to corresponding authorship — that is, the authors who reply to letters and deal with correspondence.
The authors state that UCP scientists were, according to their analysis, involved with 41.7% of the 25,805,462 published items indexed in Scopus. That’s 10,760,878 items divided by 150,608 UCP authors, or 71 items per author over the 15 years, which boils down to nearly five items per year. If this were to include book chapters, patent applications, conference proceedings, and all types of editorial matter, it’s plausible. For a busy author, it’s clearly doable. After all, according to Google Scholar, the most recognizable author on this paper (Ioannidis) has been a co-author on 58 papers just this year alone.
A third issue is the timing issue. The authors used the calendar to define each year, which seems reasonable but may not capture rolling 12-month periods accurately enough. For instance, if I published a paper in January 2009 and another in December 2010, I’d be counted as having published in consecutive years, even though I’d been given 23 months to do so. However, if I published in February 2009 and then in January 2011, I’d have published just as frequently but be counted as non-consecutive. There is no indication in the paper how often this is the case. The authors did create groups they call “Skip-1” (if an author didn’t publish in one year in the set but otherwise would have been a UCP author) and “Skip” (if an author skipped any years and then resumed in a subsequent year, but would not have otherwise ever qualified as a UCP author). These groupings still don’t address timing issues like that outlined above. Having counted the months between events would have generated more regular data.
This unevenly defined population is then subjected to counts of citations and h-index values, in an effort to understand if UCP authors are more highly-cited and generally have higher h-indices. Clearly, the answer should be yes, and it is. But are these UCP individual authors, or poorly disambiguated clusters of identifiers? Did the researchers just find the senior researchers across science, who have more honorary authorship? Did they just find the corresponding authors? Could the 1% simply be those with the hardest names to disambiguate? If Scopus has 31% (or even 90%) accuracy of disambiguation, the authors could be measuring an artifact.
Finally, even if this were true, is it interesting? Not especially. Does it matter that every field has well-known authors who are natural writers and who publish scientific papers, editorials, book reviews, letters, and responses on a consistent basis? As noted above, most media skew to a small proportion of stellar performers. It may simply be unavoidable. But it also seems to me very likely that this study overstates the disproportionate contribution of these hyperproductive types. Not only might the estimates be off by a significant degree, but the implication about what type of content is being counted may be misleading.
This study raises a lot of interesting issues — disambiguation, workforce estimates, citation concentration, measurement to meaning, and social norms. However, after examining the paper in question, I’m 99% convinced it can’t support its claims.