Science is based on a meritocratic model that rewards those who have made significant contributions to our understanding and application of human knowledge. When resources are limited and competition for funds and positions is high, we are drawn toward performance indicators in order to rank the field of contenders. Numbers provide a more objective basis of comparison, or so we are told.
For the last four decades, measuring the incorporation of a piece of research into the corpus of scientific literature–through the act of citation–had near monopoly status as a proxy for scientific impact. Indeed, the citation became reified as impact itself through indicators like the Impact Factor. Today, dozens of new services hover around the perimeters of science publishers, gathering, aggregating and selling access to new data sources on how individuals (not just scientific authors) interact with the literature.
What these new data sources mean is still up for debate. Proponents of alternative forms of evaluation often neglect answering the question of meaning directly and instead argue that alternative (or alt-) metrics enlarges the community of conversation, tells different stories, or that it is simply too early to tell. The fact that NISO has entered the debate in an attempt to standardize these new metrics is a signal that alt-metrics may not remain so alternative after all.
Whereas NISO may help standardize how things are counted and reported, they cannot help determine what these new metrics mean. Whether tweets, for example, measure popularity or prestige or simply the noise of narcissistic authors, bored academics, and automated bots will need to be decided by individual communities for specific functions: for instance, by a research university for the purposes of promotion and tenure; by a teaching hospital for the purposes of education; or by a government health department for the purposes of public engagement. For each of these groups, “impact” means something very different. A dogmatic approach toward interpreting the meaning of metrics for users is only going to meet with widespread criticism and resistance.
For this reason, new metrics services should focus entirely on aggregating and displaying information, rather than interpreting results to their users. While I don’t think that I’ll receive any resistance to this claim, the way a service constructs and displays indicators to their users can convey deep biases.
To start, the Journal Impact Factor, produced annually by Thomson-Reuters, is a measure of the average annual performance of articles published over a two-year window and reports the result to three decimal places (e.g. 1.753). The rationale behind the decision to report three decimal places was to avoid ties within a subject collection and to allow journals to be ranked from highest to lowest. Yet, as critics have often voiced, such an approach results in a false sense of its precision.
Similarly, the Faculty of 1000 article ranking service, F1000 Prime, calculates total scores for each article that received a rating and displays the result in a star badge. Since a F1000 reviewer can give an article one, two or three stars, a reader is unable to separate the quality of the ratings from the number of reviewers using just the score. This badge, given to an article published in the New England Journal of Medicine, was the result of six recommendations. One needs a subscription to F1000 Prime to view the individual ratings behind the badge.
Altmetric also provides a badge that displays the performance of an article based on variety of statistics that the company collects. Euan Adie, its founder, told me that the Altmetric ‘donut’ (or more specifically, a French cruller) was purposefully designed to avoid creating a false sense of precision. Different types of users can focus on different color components of the donut, for instance, a press officer could just focus on red, representing the number of news outlets. An editor may focus on yellow, representing science blogs.
In reality, most articles receive little (if any) attention in the wider community, leading to donuts with little color variation or no donut at all. While users can click on the donut and get details behind the graphic, the metric does fall into the same issue as calculating a total score. What is more worrisome is that the algorithm behind the badge weights indicators in a way that is non-transparent to users.
For example, this badge from a Scientific Reports article indicates that the paper was Tweeted by 10 but received an Altmetric score of just 5, suggesting that each tweet counted for just 0.5 Altmetric points. A couple of days later, the article accumulated 14 tweets and an Altmetric score of 9, suggesting that each tweet now counted for 0.64 points. The next day, 15 tweets equalled 10 Altmetrics points. The company’s description for how the Altmetric score is calculated reveals that their scoring algorithm is much more complex than just counting, weighting and summing the scores of each impact component:
So all else being equal each type of content will contribute a different base score to the article’s total. For example, a tweet may be worth 1 and a blog post 5. In practice these scores are usually modified by subsequent steps in the scoring algorithm.
I contacted Euan Adie for more details on their algorithm. He responded that each data source starts with a base score, which is then weighted based on the author of the mention. The weight of each author is based on three components: bias, promiscuity, and reach. For Twitter, bias is measured by how often an author tweets about the same journal or DOI prefix; promiscuity is measured by how often an author tweets about papers; and reach is measured by an author’s number of Twitter followers as well as how many of those authors also tweet about papers. Add this calculation to the construction of the other component indicators that go into the Altmetric score and the result is a very data and computationally intensive process.
An author skeptical of his score would need the complete dataset and algorithm from Altmetric to validate the result. This makes validation of index scores nearly impossible. In defense, Adie responded:
I do sometimes worry that emphasising the scores and algorithm to our users any more than we do is actually a bad thing. Perhaps if you’re ever at the point where you’re worrying about the exact weighting (rather than the general principles) of the algorithm then you’re doing it wrong and should be focusing more on the underlying conversations instead.
This argument, that the Altmetric score needs to be used wisely and not taken out of context, seems familiar to Eugene Garfield’s defense of the Journal Impact Factor. If the creators of an index are concerned that people will use it as a mechanism to rank articles, authors and journals, they should avoid calculating numerical scores and focus instead on the descriptive presentation of the data. For Altmetric, this means saving the donut but dropping the (jelly-filled) score.
I want to end with an index that was not designed to measure scientific impact but human well-being. Well-being is a multidimensional construct, like scientific impact, that can be measured in many different ways. Like scientific impact, well-being has been dominated for years by economic indicators, most notably, the Gross Domestic Product (GDP).
Recently, the OECD developed a Better Life Index for measuring the well-being across countries, something Toby Green, Head of Publishing, referred to as an alternative to the GDP (an “Alt-GDP!”). Using a flower model, each indicator is represented by a separate flower petal, with the height of the flower being its total index score. Yet, unlike the Altmetric score, the Better Life Index allows users to weight each indicator on a sliding scale, from least important to most important. Slide each topic dial and the graph adjusts to the new weighting: some petals get fatter, others thinner, and watch the lengths of the flowers adjust to their new values.
It’s not difficult to apply this model of index construction to the measurement of scientific impact. The Better Life Index allows for the aggregation of many different indicators of impact but reserves the weighting and ranking to the user. In this way, the Better Life Index avoids the dogmatism of other multidimensional indices, whose creators have predefined the value of each indicator.
In the next several years, we’re likely see competition for dominance in the publication impact industry. If we view the creation of an index as having two component parts –data aggregation and data visualization– there is room for multiple services to coexist within this space. Indeed, one company could focus on data aggregation (providing validity and reliability testing to ensure high-quality data) while others create various end-user tools to represent and allow manipulation of the data.
In sum, the strategy for alt-metrics should not be in developing authoritative indexes to compete with the prominence of citation-based metrics like the Impact Factor, but in aggregating and displaying reliable data. For multidimensional data, the authority in the system should be held, where it belongs, in the hands of the user.