first sign: Sign of membership, The Leading Ho...
Image via Wikipedia

The Impact Factor has long been criticized as a flawed measurement of the quality and importance of scientific research.  Some even go to ridiculous lengths to blame all that is difficult in an overcrowded and underfunded profession on this measurement, rather than on the economic causes that spur its misuse. Even if we do away with the Impact Factor altogether, with limited amounts of funding and limited numbers of jobs available, a system for measuring the quality and importance of a given researcher’s work would still be needed.

While a great deal of effort is going into the creation of new metrics and better measurements, we should be careful not to replace a flawed metric with one that is even worse.

Much comment has already been made regarding the Nature Publishing Group’s recent announcement of their very PLoS ONE-like new journal, Scientific Reports. Perhaps overlooked in the many varied responses is this note in the press release of yet another PLoS-like inclusion:

To enable the community to evaluate the importance of papers post-peer review, the Scientific Reports website will include most-downloaded, most-emailed, and most-blogged lists.

Are these measurements particularly meaningful? Do they really allow the community to evaluate importance? Is this instead just more misguided blind faith in the “wisdom of the crowds”?

First, we need to dispense with the notion of “most-blogged lists” as actually being any sort of measurement of impact whatsoever. Making the reader aware when someone has written commentary on a particular paper is a nice feature, to be sure, but the sheer act of being mentioned in a blog provides no indication of quality. Articles that are spectacularly bad are much more likely to receive blog coverage than results that are very good, but perhaps not earthshattering. The recent paper on arsenic-based life forms is a great example — this was probably the most blogged about article of the last year. Was it the most important article published last year? Probably not.

What, then, does knowing the number of blog mentions tell you about a paper?

It’s also important to realize the tremendously wide variability seen in science blogs. Science blogs are written by a tiny minority of scientists (and often the best science blogs come from professional writers, not scientists). Mention in a blog can vary anywhere from an insightful point-by-point critique of a paper to a marketing effort on the part of an author or journal editor to improve Google rankings.

Which then leaves us with metrics like “star” ratings (deeply problematic, and covered here), comments (underutilized and showing no signs of achieving traction in the community), and those mentioned by Nature, most-downloaded and most e-mailed.

Download and e-mail counts seem, on the surface, like useful pieces of data. I’m not sure how much downloads really tell you though, other than that the abstract, or perhaps even just the title or author list, was sexy enough to get someone to download the PDF. It doesn’t tell you if they actually read it, or if it was any good or had any significant impact. I know I have an enormous number of PDFs sitting on my hard drive that I will likely never get around to reading.  Should those papers be seen as more important because I bothered to click on a link?

E-mail-driven recommendations at least give the impression that someone thought the article was of sufficient interest to notify a colleague. But how reliable are such systems? A recent report suggests that even the most robust, socially driven metric systems are easy to game.

Writing in The Daily Beast, Thomas Weber wondered what it would take to move an article into the New York Times’ highly influential “most e-mailed” list. The Times attracts some 50 million unique visitors each month, and given those numbers, it would seem a much more difficult system to game than that of a smaller circulation publication like a science journal.

The results of Weber’s efforts are thus somewhat shocking.

In his first trials, Weber went after a smaller target, the Times’ Science section’s most e-mailed list, and did so merely by recruiting a group of friends to e-mail each other an obscure article. In one attempt, 48 people e-mailing an article over the course of several hours put that article at #6 on the Science section’s list. A follow-up used required only 35 senders to drive an article to #5 on the list.

Weber then went after the Times’ main list:

So if a few dozen people can make a Science article “popular” on that channel, what does it take to get onto the big board, the one noticed by millions of readers, and even ardently followed, I’m told, by Times editors and writers themselves?

Rather than enlist friends, Weber used the Mechanical Turk, the online labor marketplace, where for a small fee, he hired volunteers to register for a Times account and then e-mail a selected obscure story.  After a heated effort, their story reached the #3 spot on the Times’ main list of most e-mailed stories. The total number of e-mails needed to pull it off?  1,270.

The fact that the No. 3 slot was attained with fewer than 1,300 emails also raises the question of whether others might seek to manipulate the list for their own gain, or may have already done so. Notably, the most-emailed list of virtually every other news site in the world surely requires a small fraction of that number, and thus would be far easier to game.

Given the circulation of most science journals, how hard would it be to get everyone in your network to e-mail your paper to a colleague?  Where would such an action fall in terms of acceptable and moral behavior?  Is it all that different from having a press conference and issuing a press release? If you honestly believe it’s an important paper, would you be wrong for letting colleagues know about it and asking them to spread the word? Or would this be considered fraud, an attempt to cheat the system?

Social ranking systems like this inherently favor the better-networked author, so it’s no surprise to see them highly-favored by scientists active in online communities.  A scientist who writes an entertaining blog or Twitter feed could easily boost his career standing by recruiting followers to download and e-mail a paper, regardless of its quality.  If 50 Cent can double a stock’s value with a Tweet, why wouldn’t researchers employ the same tactics for the value of their careers?

Even Amazon’s much lauded ranking and recommendation system seems vulnerable to manipulation, at least according to author Thomas Hertog, who claims that by posting positive reviews for his own book, voting up those reviews, and buying one copy of the book per day himself, he was able to drive a book that sold 32 total copies to third parties to the number one spot in Amazon’s rankings.

Clearly, even the most robust, socially driven metrics systems need some further work before they’re reliable.  Each field will need to decide what constitutes cheating, what’s allowed and what’s not. Game-proofing any such system, where careers are at stake in highly competitive fields, is going to be a continuous process of playing catch-up as researchers will look to find an edge.

But beyond the gaming factor, “social” may not be an appropriate answer for questions of  impact.  Popularity does not automatically equal quality or importance (a quick look at the most popular list on YouTube will quickly affirm that notion).

As new metrics are developed, the most likely path for the future lies in using a panel of different measurements, each selected for its own strengths, and with counterparts available to make up for its weaknesses.  As we assemble this panel though, it’s important that we don’t muddy the waters further with the impact equivalent of chart junk.  Just because you can measure something, it doesn’t automatically result in meaningful data.

Enhanced by Zemanta
David Crotty

David Crotty

David Crotty is a Senior Consultant at Clarke & Esposito, a boutique management consulting firm focused on strategic issues related to professional and academic publishing and information services. Previously, David was the Editorial Director, Journals Policy for Oxford University Press. He oversaw journal policy across OUP’s journals program, drove technological innovation, and served as an information officer. David acquired and managed a suite of research society-owned journals with OUP, and before that was the Executive Editor for Cold Spring Harbor Laboratory Press, where he created and edited new science books and journals, along with serving as a journal Editor-in-Chief. He has served on the Board of Directors for the STM Association, the Society for Scholarly Publishing and CHOR, Inc., as well as The AAP-PSP Executive Council. David received his PhD in Genetics from Columbia University and did developmental neuroscience research at Caltech before moving from the bench to publishing.

Discussion

8 Thoughts on "How Meaningful and Reliable Are Social Article Metrics?"

“Some even go to ridiculous lengths to blame all that is difficult…”

Well, making snide comments about a thoughtful analysis from luminaries like Eve Marder and Sten Grillner doesn’t reflect well on you, to put it mildly. All they are saying is that in-depth review by people competent in the field is required for a real evaluation of a scientist. Impact factors are about as useful or as useless as any other attempt (so-called “new metrics”) to evaluate science without actually knowing anything about it. Whatever “panel of different measurements” you select will fail one way or another, unless the “metrics” include informed evaluation by people who actually take the time to read the stuff, and are competent to evaluate it. ‘Tis really that simple.

The authors’ lofty status does not place them above criticism. In fact, their lofty status is problematic here–they’re apparently out of touch with the pressures felt by today’s young scientists, and rose to their secure positions in a different era. The essay comes off as something of a call for mediocrity, don’t strive to do important or unique work, just do science. Contrast that with Doug Green’s recent essay on how to stand out from the crowd and succeed in science: do something astonishing. Green seems to have a better grasp on how competitive science is these days.

And even if you do eliminate all metrics like the Impact Factor and rely solely on “informed evaluation by people who actually take the time to read the stuff”, then the exact same pressures exist. Young researchers will just have to work on impressing those evaluation committees rather than journal editors and peer reviewers (seriously, do you think that journal editors and peer reviewers don’t “actually take the time to read the stuff” they’re reviewing?). To succeed with such a committee, you will need to do novel and important work–the exact same things you need to do to publish in a top journal.

A colleague at Johns Hopkins tells me they receive between 400 and 500 applications for every professorial job they advertise. Do you really think that the faculty there should spend the time to do an in depth evaluation of 500 applicants for every job? Is that an efficient use of their time? Is it instead better to narrow down that group to a workable number who can then be evaluated in more depth? How would you propose to drop 90-95% of the applicants without looking at their publication record?

The Impact Factor is being used in the essay as a bogeyman, a scapegoat for the difficult economic situation faced by scientists. Eliminating it changes very little. Better solutions involve things like eliminating tenure, which would certainly open up more positions for young scientists or creating more mid-level “staff scientist” positions at institutions, allowing those who do not excel to earn a living at a level above that of a postdoc. Better metrics would be an improvement, to be sure, but I don’t see any way around needing a system to efficiently evaluate a large group of candidates.

Good post and good reply. Just a few short remarks on the gaming of social metrics.
Citations are a social metric and gameable in many ways, not only by authors but also by publishers (we all know the prominent cases).
The larger a social group, the larger the social circle needed to game the metric. As you correctly point out, this doesn’t prevent gaming, but every additional member of this circle of gamers is an additional potential leak.
Clearly, a balanced and flexible panel of metrics is required, meaning that a lot of innovation still needs to happen on that front.

Citations as a metric do have their own set of problems. I think they fall higher on the “meaningful” scale than things like downloads or blog mentions though (see Kent’s post today as an example). A citation is a researcher publicly and for the record stating that a previous piece of research had an impact on his current work. That to me means more than the number of people interested enough in the abstract to download a pdf, or that the paper caught the attention of the small number of scientists and journalists who discuss such things online.

As you point out, the “reliability” is still problematic. I think citation again falls higher on the scale than the sorts of article level metrics being offered by Nature as it is vastly more difficult to game. To fiddle with citations, you actually have to write a new paper, and get that through an editor and a set of peer reviewers. That’s a much higher bar than clicking on a link or adding a link to a paper in a blog. Many metrics also account for things like author self-citation and journal self-citation, discounting these when compared to citing other authors and other journals. This is possible because of the lack of anonymity, a journal author and publisher are clearly identified whereas a downloader, an e-mailer or a blogger remains anonymous.

But I think in general we’re in agreement that a panel of measurements appropriately weighted is the way to go. Selecting those measurements is still an open question though.

David,

No time now (grant deadline looming), so very short response.
1) I think you are misreading Marder et al – they are not calling for mediocrity – far from it – they are making the point that rather than having endless fights with reviewers to get something into a higher impact journal, ‘twould be better to actually do new science. Us working scientists have to deal with such dilemmas regularly. And if you think Eve Marder or Sten Grillner are out of touch, then clearly you don’t know them.
2) Doug Green – very cute essay, unfortunately it doesn’t tell us anything new. All scientists desperately want to do something astonishing… .
3) Scanning piles of resumes is something all of us sentenced to search committee duty do. No, your friends at Hopkins don’t need to read 500 resumes in depth; but they will need to do that for the top ten or top twenty. No way impact factors or social metrics will do that job for them. Depending exclusively on such criteria will turn science into a fashion game…

I don’t know the authors personally, so can’t speak beyond my impressions from their article. I agree that they did not deliberately set up a call for mediocrity. But in some ways, that’s the end result of their argument. Researchers must strive for novelty and importance in their work, regardless of whether that is measured by Impact Factor or by scanning resumes or the informed evaluation you spoke of earlier. That puts a huge amount of pressure on the young researcher, but it’s not a result of the Impact Factor or the whims of publishers, it’s a result of the economic realities of an overcrowded and underfunded occupation.

Researchers publish papers in high Impact Factor journals by doing sterling work. Yes, there are some instances where things get bogged down by difficult reviewers, but the bottom line is that the quality of the work is what matters most. Eliminating the Impact Factor does not alleviate the pressure to excel. To do that, you’d need to create more jobs or spread funding more democratically rather than doing it via a merit-based system.

And science is, and should be a meritocracy. I do agree with you that the Impact Factor is flawed and is terribly mis-used. It should not be the sole determining factor in any decision whatsoever. In my blog entry above, I call for a (vague) panel of measurements that can provide a deeper picture of a researcher’s contributions to the field. You’re right in that the review committees for the top candidates should and are going to look at more than just Impact Factor. But many of the key they’re looking for, a track record of doing novel and impactful research, are the same things the Impact Factor tries to measure (if imperfectly). Take it away as a formal metric and the same pressures exist.

In her feature article, Peer review: Trial by Twitter, (Nature, 19 Jan 2011) Apoorva Mandavilli explores post-publication review and gives equal say to their promoters as well as their critics. Some notable quotes:

“It makes much more sense in fact to publish everything and filter after the fact” — Cameron Neylon

“Most papers sit in a wasteland of silence, attracting no attention whatsoever” — Phil Davis

“Who in their right mind is going to log on to the PLoS One site solely to comment on a paper?” — Jonathan Eisen

“I think we do not want it [Twitter] to be just a commentary free-for-all as the only arbiter of quality.” — David Goldstein

Interesting read! And good that you say that “[i]t’s also important to realize the tremendously wide variability seen in science blogs.”, though you also make a claim about the goodness of a blog, without explaining what that means.

The other excellent comment is that you link it to game theory, which is IMHO an accurate observation, and most certainly how I have been using it. And to all readers, this is common Scientific practice for many, many years: we used to use conference talks and posters for this, now we use blogs and tweets. Nothing really has changed, other than the communication channel.

Now, what does matter is that people (scientists and scholars in general) must realize that is is really not metrics, blogs, Nature Commentaries, or whatever that matters: it’s the actual arguments outlined that do. When we rank topics by email received (or mere citation counts for journals), we no longer look at arguments, we look at popularity. But, as you nicely discuss, this is the way it currently just works.

Comments are closed.