4.5/5 stars used for ratings on en.
Image via Wikipedia

The impact factor has long been recognized as a problematic method for interpreting the quality of an author’s output.  Any metric that is neither transparent nor reproducible is fatally flawed.  The Public Library of Science (PLoS) is trying to drive the creation of new, better measurements by releasing a variety of data through their article level metrics program.  PLoS is taking something of an “everything but the kitchen sink” approach here, compiling all sorts of data through a variety of methods and hoping some of it will translate into a meaningful measurement.

There are, however, a lot of issues with the things PLoS has chosen to measure (and to their credit, PLoS openly admits the data are ripe for misinterpretation — see, “Interpreting The Data“). Aside from the obvious worries about gaming the system, my primary concern is that popularity is a poor measure of quality.  Take a look at the most popular items on YouTube on any given day and try to convince yourself that this is the best the medium has to offer. Ratings based strictly on downloads will skew towards fields that have more participants.

While PLoS does break down their numbers into subject categories, these are often too broad to really analyze the impact an article has on a specific field.  A groundbreaking Xenopus development paper that redefines the field for the next decade might see fewer downloads than an average mouse paper because there are fewer labs that work on frogs.  Should an author be penalized for not working in a crowded field?

Probably the worst metric offered by PLoS is the article rating, a 5-star system similar to that employed by Amazon.

These rating systems are inherently flawed for a variety of reasons.  The first is that these systems reduce diversity and lead to what’s called “monopoly populism”:

The recommender “system” could be anything that tends to build on its own popularity, including word of mouth. . . . Our online experiences are heavily correlated, and we end up with monopoly populism. . . . A “niche,” remember, is a protected and hidden recess or cranny, not just another row in a big database. Ecological niches need protection from the surrounding harsh environment if they are to thrive.

Joshua-Michele Ross at O’Reilly puts it this way:

The network effects that so characterize Internet services are a positive feedback loop where the winners take all (or most). The issue isn’t what they bring to the table, it is what they are leaving behind.

Many people assume that readers/customers are more likely to leave a negative review than a positive one.  It only seems logical, if you had an adequate experience, why bother going online to write about it?  But if you’re angry and feel ripped-off, this is a form of revenge.  It turns out that in reality, this is not how things work.  According to the Wall Street Journal (article behind a paywall, but you can read it by following the top link from Google here), the average rating a 5-star system generates is 4.3, no matter the object being rated:

One of the Web’s little secrets is that when consumers write online reviews, they tend to leave positive ratings: The average grade for things online is about 4.3 stars out of five. . . . “There is an urban myth that people are far more likely to express negatives than positives,” says Ed Keller. . . . But on average, he finds that 65% of the word-of-mouth reviews are positive and only 8% are negative.

The WSJ article’s author gives some insight into the psychology behind such positivism here:

The more you see yourself as an expert in something, the more likely you are to give a positive review because that proves that you make smart choices, that you know how to pick the best restaurants or you know how to select the best dog food. And that’s what some research from the University of Toronto found. Specifically in that study they found that people generally gave negative reviews at the same rate, but people who thought of themselves as experts on topics were way more inclined to give positive reviews.

Amazon’s ratings average around that 4.3 point, and YouTube’s are even more slanted, with the vast majority of reviews giving 5 stars to videos.  A quick look at PLoS’ downloadable data (the most recent available runs through July 2009, so caveats apply because of the small sample size) shows the following:

  • 13,829 articles published
  • 708 articles rated
  • 209 = 5 stars
  • 324 = 4 to 5 stars
  • 122 = 3 to 4 stars
  • 33 = 1 to 2 stars
  • 19 = 1 to 2 stars
  • 1 = 0 to 1 star

Add up the individual ratings and the average comes out to 4.16.

Is PLoS really the publishing equivalent of Lake Wobegon, “where all the children are above average”? Or is this just another example where a rating system gives overinflated grades?

Unless detailed instructions are given, it’s difficult for a reviewer to know exactly what they’re ranking.  PLoS does give a set of guidelines, asking the reader to rank according to insight, reliability, and style.  But it’s unclear what’s being compared here.  Is one supposed to give a ranking based on a comparison to every other paper published?  To just papers within the same field?  To papers within the same journal?

The five stars available also do not allow for much nuance in a review.  While still not perfect, the recent redesign at Steepster, a site for tea drinkers, shows a better method (an example found by the authors of an upcoming O’Reilly book on online reputation).  Not only does the Steepster system include a 1-100 scale for ranking, it also allows the user to put their review in context with the other reviews they’ve written.  This would be helpful if a reviewer is supposed to be comparing the relative merit of different papers.

Though PLoS should be applauded for this experiment, it’s clear that some of the methods offered are not going to prove useful in getting a clear picture of article impact.

Five-star rating systems are proving unreliable in other venues, and the same is likely to occur here. If PLoS’ original complaint about the impact factor stands — that it is determined by “rules that are unclear” — then the solution surely can’t be creating a new system that is even more unclear.