4.5/5 stars used for ratings on en.
Image via Wikipedia

The impact factor has long been recognized as a problematic method for interpreting the quality of an author’s output.  Any metric that is neither transparent nor reproducible is fatally flawed.  The Public Library of Science (PLoS) is trying to drive the creation of new, better measurements by releasing a variety of data through their article level metrics program.  PLoS is taking something of an “everything but the kitchen sink” approach here, compiling all sorts of data through a variety of methods and hoping some of it will translate into a meaningful measurement.

There are, however, a lot of issues with the things PLoS has chosen to measure (and to their credit, PLoS openly admits the data are ripe for misinterpretation — see, “Interpreting The Data“). Aside from the obvious worries about gaming the system, my primary concern is that popularity is a poor measure of quality.  Take a look at the most popular items on YouTube on any given day and try to convince yourself that this is the best the medium has to offer. Ratings based strictly on downloads will skew towards fields that have more participants.

While PLoS does break down their numbers into subject categories, these are often too broad to really analyze the impact an article has on a specific field.  A groundbreaking Xenopus development paper that redefines the field for the next decade might see fewer downloads than an average mouse paper because there are fewer labs that work on frogs.  Should an author be penalized for not working in a crowded field?

Probably the worst metric offered by PLoS is the article rating, a 5-star system similar to that employed by Amazon.

These rating systems are inherently flawed for a variety of reasons.  The first is that these systems reduce diversity and lead to what’s called “monopoly populism”:

The recommender “system” could be anything that tends to build on its own popularity, including word of mouth. . . . Our online experiences are heavily correlated, and we end up with monopoly populism. . . . A “niche,” remember, is a protected and hidden recess or cranny, not just another row in a big database. Ecological niches need protection from the surrounding harsh environment if they are to thrive.

Joshua-Michele Ross at O’Reilly puts it this way:

The network effects that so characterize Internet services are a positive feedback loop where the winners take all (or most). The issue isn’t what they bring to the table, it is what they are leaving behind.

Many people assume that readers/customers are more likely to leave a negative review than a positive one.  It only seems logical, if you had an adequate experience, why bother going online to write about it?  But if you’re angry and feel ripped-off, this is a form of revenge.  It turns out that in reality, this is not how things work.  According to the Wall Street Journal (article behind a paywall, but you can read it by following the top link from Google here), the average rating a 5-star system generates is 4.3, no matter the object being rated:

One of the Web’s little secrets is that when consumers write online reviews, they tend to leave positive ratings: The average grade for things online is about 4.3 stars out of five. . . . “There is an urban myth that people are far more likely to express negatives than positives,” says Ed Keller. . . . But on average, he finds that 65% of the word-of-mouth reviews are positive and only 8% are negative.

The WSJ article’s author gives some insight into the psychology behind such positivism here:

The more you see yourself as an expert in something, the more likely you are to give a positive review because that proves that you make smart choices, that you know how to pick the best restaurants or you know how to select the best dog food. And that’s what some research from the University of Toronto found. Specifically in that study they found that people generally gave negative reviews at the same rate, but people who thought of themselves as experts on topics were way more inclined to give positive reviews.

Amazon’s ratings average around that 4.3 point, and YouTube’s are even more slanted, with the vast majority of reviews giving 5 stars to videos.  A quick look at PLoS’ downloadable data (the most recent available runs through July 2009, so caveats apply because of the small sample size) shows the following:

  • 13,829 articles published
  • 708 articles rated
  • 209 = 5 stars
  • 324 = 4 to 5 stars
  • 122 = 3 to 4 stars
  • 33 = 1 to 2 stars
  • 19 = 1 to 2 stars
  • 1 = 0 to 1 star

Add up the individual ratings and the average comes out to 4.16.

Is PLoS really the publishing equivalent of Lake Wobegon, “where all the children are above average”? Or is this just another example where a rating system gives overinflated grades?

Unless detailed instructions are given, it’s difficult for a reviewer to know exactly what they’re ranking.  PLoS does give a set of guidelines, asking the reader to rank according to insight, reliability, and style.  But it’s unclear what’s being compared here.  Is one supposed to give a ranking based on a comparison to every other paper published?  To just papers within the same field?  To papers within the same journal?

The five stars available also do not allow for much nuance in a review.  While still not perfect, the recent redesign at Steepster, a site for tea drinkers, shows a better method (an example found by the authors of an upcoming O’Reilly book on online reputation).  Not only does the Steepster system include a 1-100 scale for ranking, it also allows the user to put their review in context with the other reviews they’ve written.  This would be helpful if a reviewer is supposed to be comparing the relative merit of different papers.

Though PLoS should be applauded for this experiment, it’s clear that some of the methods offered are not going to prove useful in getting a clear picture of article impact.

Five-star rating systems are proving unreliable in other venues, and the same is likely to occur here. If PLoS’ original complaint about the impact factor stands — that it is determined by “rules that are unclear” — then the solution surely can’t be creating a new system that is even more unclear.

David Crotty

David Crotty

David Crotty is a Senior Consultant at Clarke & Esposito, a boutique management consulting firm focused on strategic issues related to professional and academic publishing and information services. Previously, David was the Editorial Director, Journals Policy for Oxford University Press. He oversaw journal policy across OUP’s journals program, drove technological innovation, and served as an information officer. David acquired and managed a suite of research society-owned journals with OUP, and before that was the Executive Editor for Cold Spring Harbor Laboratory Press, where he created and edited new science books and journals, along with serving as a journal Editor-in-Chief. He has served on the Board of Directors for the STM Association, the Society for Scholarly Publishing and CHOR, Inc., as well as The AAP-PSP Executive Council. David received his PhD in Genetics from Columbia University and did developmental neuroscience research at Caltech before moving from the bench to publishing.

Discussion

27 Thoughts on "How Meaningful Are User Ratings? (This Article = 4.5 Stars!)"

Another issue with the system is that so few articles are rated (barely 5 percent, by the data provided above). It is interesting to speculate on why this might be – no time? Not seen as a priority? Low community buy-in also severely hampers the utility of a rating system (beyond the other issues brought up).

It’s hard to say, given the timing of the data. PLoS made their big announcement about article level metrics in September, yet the only downloadable data available stops at the end of July. Participation may have increased as awareness was raised, although in a random sampling looking at the “most popular” articles in several PLoS journals, I couldn’t find any that were rated.

Another factor to consider is that the majority of readers read articles as PDF files. That means they’re disconnected from the journal’s site when they read the article and have to have strong motivation to go back to the site, find the paper again, and leave a rating.

May be I’m one of those odd kind. I never really like to read in pdf file.

Excellent post, David.

All indicators are “flawed” in that they attempt to measure something abstract like quality. We should therefore evaluate new indicators using a pragmatic approach, looking for indicators that are better than the ones we wish to replace.

Ideally, we want indicators that:

1) are more valid (stronger coupling to the concept they attempt to measure).

2) are more reliable (demonstrate consistency over time).

3) are more transparent (suspect data can be verified).

4) are less prone to gaming

To be fair to PLoS, readers who wish to rate articles using their 5-star system must first register with PLoS. While one may take out hundreds of pseudonymous accounts, it does make it harder than other ratings systems that allow an individual (or robot) to rate over and over and over again.

David, I agree that user ratings are a poor indicator of quality. But I don’t see PLoS taking the “everything but the kitchen sink” approach. They are interest in usage data (HTML fulltext views and PDF downloads), and these are clearly valid scientific impact measures (see for example http://dx.doi.org/10.1371/journal.pone.0006022).

Your link doesn’t work, it has an extra parentheses at the end, so here’s a correct version. By “kitchen sink”, I meant that they’re tracking as much data as they can for each article. The idea is to make it all available and let the community work out which data is useful for measuring impact. Fulltext views and pdf downloads are certainly valid measures as are citations, which PLoS tracks as well. But the factors they track go well beyond that, to many things that are not well established or likely to be effective measures of impact. Given the small percentage of the community that participates in social networks or blogs, these measurements are likely to be skewed toward favoring the particular interests of the small but vocal online community. Furthermore, things like blog coverage are going to skew toward papers that inspire posts, not necessarily the best science. I’m sure something flashy like an article on dolphin sex would gather more blog coverage than something boring but probably more meaningful to researchers like a discovery of a step in a signal transduction pathway.

The example of Youtube or Amazon isn’t quite valid because you don’t have peer-review nor an editorial process before something appears on either of those sites. Nonetheless, I’d agree that user ratings might not be all that useful. I didn’t find the star ratings on Amazon useful at all until just recently.

I’m enjoying this series of “How useful is X, really?” where X=any innovation in publishing occurring this decade. Can’t wait for the surely upcoming “Here’s my proposed solutions” series.

I’m not sure how much the thing being rated really matters–human behavior seems to be the same across the board, whether ranking dog food or treatises.

As for the second part, too true. It’s much easier to poke holes in bad ideas than it is to come up with a good idea. When I come up with my grand unified theory to revolutionize publishing, you probably won’t read it here, instead you’ll read about the island I’m purchasing with the proceeds from licensing my idea.

But for the record, in case it works, here’s my latest plan: Rupert Murdoch succeeds in his plan to start bidding wars for search engine spidering access. Google (or Bing or anyone, really) decides they want to be the source for scientific info, so they start paying large sums to journals for exclusive search engine access to their content (sorry PubMed). The extra income covers the parts of the process that the author-pays model can’t quite make up for low-volume journals, allowing everyone (at least everyone who wants to) to go open access.

Not sure it would work, or if the ad sales against search engine traffic would warrant high enough payments to cover costs, but I figure there are probably enough VC’s out there with funds to burn to give it a shot.

The PLoS article-level metrics combine usage data with other stuff. I wouldn’t call ratings, comments, social bookmarking, blog posts, etc. metrics as they don’t really measure anything. But they are useful nevertheless, as they connect to the discussion that takes places around an article.

Perhaps Ilya Grgorik of Postrank would be able to say something smart and data-driven on what the effect of user ratings and bookmarks is?

That’s a really interesting piece, and it points out that in their observations, you’re seeing less and less discussion/engagement with material on the site where it is published, and more and more happening elsewhere (Twitter, Facebook). This would argue against a rating system within the journal itself, and for the sorts of metrics (or non-metrics) PLoS is offering regarding tracking the conversation elsewhere.

Though I’d still add the caveat I mentioned in a comment above about dolphin sex. The science community using tools like blogging is still pretty small, and things that spur conversations and fun blog articles may not be the most impactful or important pieces of research.

I’m glad you liked that.

You make a good point that the number of scientists blogging is a fraction of scientists overall, but you could also note from that article that engagement overall is growing quite quickly, so the community of scientists online is growing to be more representative of the scientific community in general. Whether or not that’s a good thing, I’ll leave for debate.

From my vantage point, considering the activity around such initiatives as Linked Data, Contributor IDs, and Open Access, just to name a few, it certainly does look like this is the direction things are going.

I think engagement will grow, but it’s unclear how much will be directly related to the sorts of judgments that are being sought here. As we’ve discussed elsewhere, the “killer app” to drive scientists into these online communities has yet to be invented. It may have nothing to do with published results or their discussion, so it’s hard to say whether they’ll be useful for gauging impact.

One other thought–decentralized comment may be particularly important in science given how few people read papers online in html. If everyone downloads the pdf and reads it elsewhere, it’s probably easier to then go to a blog or Twitter or Friendfeed to comment rather than going back to the journal and finding the html version of the article.

I definitely agree that decentralization is going to be vital for the success of online commenting and I know there is some infrastructure being laid for this as we speak.

I think there’s a maximum nesting level for comments, so no telling where this will end up.

There are interesting arguments here David, thanks. There seems to be a bit of a ‘Matthew Effect’: papers already rated will attract more ratings. You might expect Faculty of 1000 (f1000) to have a view on this, and we do.

We all know how bad the Impact Factor is, but the fact remains that people want a metric, preferably a single number, to assess research. Here in the UK the Research Excellence Framework and its predecessors depend on it. Whether good or bad, that is something we have to deal with. However, whether we end up with a single number or a basket of indicators it’s good to see other parties also trying to do something to break the hegemony.

So yes, we also have a number. We’re actually in the middle of re-designing that number, to make it more obvious and transparent. Our model is that we only allow evaluations (in the first instance) of ‘good’ papers: so only one to two percent of the literature ever gets a score at f1000. That means that everything in f1000 is automatically 4.5 stars 😉 — but within that we do see a wide range of scores. And yes, we do get dissenting opinions, which is where things really get interesting. We will be encouraging debate and discussion of the papers selected at f1000, and it would be really good if we could find some way of quantifying the discussion and throwing that into the mix. That might take a little more time and thought, though.

I should add that our rankings are made by named ‘experts’, whose reasons are published with the ratings they give. Each ranking is clearly ‘auditable’: we will soon be publishing the score each Faculty Member gives an article alongside their evaluation (and we’ve never hidden how they’re calculated anyway).

In the meantime it’ll be interesting to see how our rankings compare with PLoS’s metrics and the Impact Factor. I’ll keep you posted.

Great, glad to hear that more experimentation is going on, it’s certainly needed. Your “Matthew Effect” is what I’ve called “Monopoly Populism” in the article, the idea that good ratings tend to snowball and garner more attention, leaving other, perhaps more worthy articles that aren’t rated behind.

I worry though, about any rating system that’s based on the opinions of an editorial board. Realistically, every journal, not just f1000 can say they do this, and that all of their accepted papers are at least a 4.5. But we all know that the International Journal of Nonsense’s 4.5 isn’t quite the same as Science’s 4.5. We need measurements that can span across journals, that we can plug any article into. If your faculty panel doesn’t pick an article, then it doesn’t get rated, and the researcher doesn’t get whatever credit (good or bad) for the research.

To me the issue with the Impact Factor is the undefined subjective adjustments that their editorial board supplies. Your system sounds like it adds some transparency, but it’s still a subjective judgment based on the opinion of a select board, so you’re really just replacing Thomson/ISI with your own folks. I want a system that’s more empirical, and that doesn’t rely on a private group’s opinions.

I look forward to hearing more about your system, but from the description here, I’d add one more thought. If you assume all papers in f1000 are of high quality (at least a 4.5), then your system either needs to run from 4.5 to 5 (rating the article against all of the other literature available), or it needs to be strictly defined as rating the papers against other papers selected for f1000 (then the scale can be whatever you’d like). The more clearly you can define the criteria for a rating, the more meaningful it becomes.

It is really against other papers in f1000 (we’ve considered ranking all of the rest 1-5 and calling everything in f1000 6+ but that might cause problems).

Our ‘folk’ are actually peer-selected, by the way. It’s community-driven: we just enable it. The other thing is that we’re not trying to rank the entire literature; rather we’re filtering it so that people can find the good stuff (as determined by their own community)–the rankings we have are more or less a by-product.

It’s definitely a worthy effort, given the overload of literature being published these days–filtering systems are becoming more and more valuable.

I’m just wondering if there’s a way to measure impact that doesn’t involve any sort of subjective judgment by an individual or group of individuals. Is there a way to take opinions and feelings out of the equation?

Johan Bollen has done that analysis of usage, creating some lovely network maps in the process.

But I’d maintain that scientists’ opinions, especially those who have been around the block a time or two, are important and should be incorporated into the equation.

You may well be right, and opinion may be necessary to truly understand impact. I just worry that it introduces so much bias and potential for conflicts of interest as well as making things essentially unreproducible.

Perhaps someone like Thomson/ISI is indeed the way to go, a neutral third party could make the judgments without logrolling for their colleagues or propping up their own research. Of course, it would need to be transparent and consistent, easier said than done.

We have not officially launched yet but we’d love to have your thoughts!

Some useful and relevant points on whether attention is a good metric for measuring quality in this excellent talk by Dana Boyd:

We may be democratizing certain types of access, but we’re not democratizing attention. Just because we’re moving towards a state where anyone has the ability to get information into the stream does not mean that attention will be divided equally. Opening up access to the structures of distribution is not democratizing when distribution is no longer the organizing function.

Some in the room might immediately think, “Ah, but it’s a meritocracy. People will give their attention to what is best!” This too is mistaken logic. What people give their attention to depends on a whole set of factors that have nothing to do with what’s best….

People consume content that stimulates their mind and senses. That which angers, excites, energizes, entertains, or otherwise creates an emotional response. This is not always the “best” or most informative content, but that which triggers a reaction…

In a world of networked media, it’s easy to not get access to views from people who think from a different perspective. Information can and does flow in ways that create and reinforce social divides.

Those interested in commenting systems and post-publication review for scholarly papers might want to take a look at this trainwreck, where the author of a book becomes irate over negative reviews of that book, first pretending to be a reader (“Niteflyer One”), then admitting to being the author in question. The author has since pulled her embarrassing comments off of Amazon (which apparently included threats of turning negative reviewers into law enforcement agencies), but you can get the gist of it from Teresa Nielsen Hayden’s coverage and the comments on her site.

We are likely to see lots of this sort of thing if commenting on papers becomes a regular practice.

A brief study here using PLoS’ article level metrics, looking for correlations. The least valuable metric?

Ratings correlates fairly poorly with every other metric. Combined with the low number of ratings, this makes me wonder if the option to rate papers on the journal web sites is all that useful.

Comments are closed.