The Impact Factor has long been criticized as a flawed measurement of the quality and importance of scientific research. Some even go to ridiculous lengths to blame all that is difficult in an overcrowded and underfunded profession on this measurement, rather than on the economic causes that spur its misuse. Even if we do away with the Impact Factor altogether, with limited amounts of funding and limited numbers of jobs available, a system for measuring the quality and importance of a given researcher’s work would still be needed.
While a great deal of effort is going into the creation of new metrics and better measurements, we should be careful not to replace a flawed metric with one that is even worse.
Much comment has already been made regarding the Nature Publishing Group’s recent announcement of their very PLoS ONE-like new journal, Scientific Reports. Perhaps overlooked in the many varied responses is this note in the press release of yet another PLoS-like inclusion:
To enable the community to evaluate the importance of papers post-peer review, the Scientific Reports website will include most-downloaded, most-emailed, and most-blogged lists.
Are these measurements particularly meaningful? Do they really allow the community to evaluate importance? Is this instead just more misguided blind faith in the “wisdom of the crowds”?
First, we need to dispense with the notion of “most-blogged lists” as actually being any sort of measurement of impact whatsoever. Making the reader aware when someone has written commentary on a particular paper is a nice feature, to be sure, but the sheer act of being mentioned in a blog provides no indication of quality. Articles that are spectacularly bad are much more likely to receive blog coverage than results that are very good, but perhaps not earthshattering. The recent paper on arsenic-based life forms is a great example — this was probably the most blogged about article of the last year. Was it the most important article published last year? Probably not.
What, then, does knowing the number of blog mentions tell you about a paper?
It’s also important to realize the tremendously wide variability seen in science blogs. Science blogs are written by a tiny minority of scientists (and often the best science blogs come from professional writers, not scientists). Mention in a blog can vary anywhere from an insightful point-by-point critique of a paper to a marketing effort on the part of an author or journal editor to improve Google rankings.
Which then leaves us with metrics like “star” ratings (deeply problematic, and covered here), comments (underutilized and showing no signs of achieving traction in the community), and those mentioned by Nature, most-downloaded and most e-mailed.
Download and e-mail counts seem, on the surface, like useful pieces of data. I’m not sure how much downloads really tell you though, other than that the abstract, or perhaps even just the title or author list, was sexy enough to get someone to download the PDF. It doesn’t tell you if they actually read it, or if it was any good or had any significant impact. I know I have an enormous number of PDFs sitting on my hard drive that I will likely never get around to reading. Should those papers be seen as more important because I bothered to click on a link?
E-mail-driven recommendations at least give the impression that someone thought the article was of sufficient interest to notify a colleague. But how reliable are such systems? A recent report suggests that even the most robust, socially driven metric systems are easy to game.
Writing in The Daily Beast, Thomas Weber wondered what it would take to move an article into the New York Times’ highly influential “most e-mailed” list. The Times attracts some 50 million unique visitors each month, and given those numbers, it would seem a much more difficult system to game than that of a smaller circulation publication like a science journal.
The results of Weber’s efforts are thus somewhat shocking.
In his first trials, Weber went after a smaller target, the Times’ Science section’s most e-mailed list, and did so merely by recruiting a group of friends to e-mail each other an obscure article. In one attempt, 48 people e-mailing an article over the course of several hours put that article at #6 on the Science section’s list. A follow-up used required only 35 senders to drive an article to #5 on the list.
Weber then went after the Times’ main list:
So if a few dozen people can make a Science article “popular” on that channel, what does it take to get onto the big board, the one noticed by millions of readers, and even ardently followed, I’m told, by Times editors and writers themselves?
Rather than enlist friends, Weber used the Mechanical Turk, the online labor marketplace, where for a small fee, he hired volunteers to register for a Times account and then e-mail a selected obscure story. After a heated effort, their story reached the #3 spot on the Times’ main list of most e-mailed stories. The total number of e-mails needed to pull it off? 1,270.
The fact that the No. 3 slot was attained with fewer than 1,300 emails also raises the question of whether others might seek to manipulate the list for their own gain, or may have already done so. Notably, the most-emailed list of virtually every other news site in the world surely requires a small fraction of that number, and thus would be far easier to game.
Given the circulation of most science journals, how hard would it be to get everyone in your network to e-mail your paper to a colleague? Where would such an action fall in terms of acceptable and moral behavior? Is it all that different from having a press conference and issuing a press release? If you honestly believe it’s an important paper, would you be wrong for letting colleagues know about it and asking them to spread the word? Or would this be considered fraud, an attempt to cheat the system?
Social ranking systems like this inherently favor the better-networked author, so it’s no surprise to see them highly-favored by scientists active in online communities. A scientist who writes an entertaining blog or Twitter feed could easily boost his career standing by recruiting followers to download and e-mail a paper, regardless of its quality. If 50 Cent can double a stock’s value with a Tweet, why wouldn’t researchers employ the same tactics for the value of their careers?
Even Amazon’s much lauded ranking and recommendation system seems vulnerable to manipulation, at least according to author Thomas Hertog, who claims that by posting positive reviews for his own book, voting up those reviews, and buying one copy of the book per day himself, he was able to drive a book that sold 32 total copies to third parties to the number one spot in Amazon’s rankings.
Clearly, even the most robust, socially driven metrics systems need some further work before they’re reliable. Each field will need to decide what constitutes cheating, what’s allowed and what’s not. Game-proofing any such system, where careers are at stake in highly competitive fields, is going to be a continuous process of playing catch-up as researchers will look to find an edge.
But beyond the gaming factor, “social” may not be an appropriate answer for questions of impact. Popularity does not automatically equal quality or importance (a quick look at the most popular list on YouTube will quickly affirm that notion).
As new metrics are developed, the most likely path for the future lies in using a panel of different measurements, each selected for its own strengths, and with counterparts available to make up for its weaknesses. As we assemble this panel though, it’s important that we don’t muddy the waters further with the impact equivalent of chart junk. Just because you can measure something, it doesn’t automatically result in meaningful data.