Post-publication review is spotty, unreliable, and may suffer from cronyism, several studies reveal.

Reporting last month in the open access journal Ideas in Ecology and Evolution [1], ecologist David Wardle analyzed over 1,500 articles published in seven top ecology journals during 2005 and compared their citation performance nearly five years later with initial Faculty of 1000 (F1000) ratings.

Faculty of 1000 is a subscription service designed to identify and review important papers soon after they are published.  F1000 reviewers assign one of three ratings to journal articles: (“Recommended” = 3; “Must read” = 6; or “Exceptional” = 9) plus a brief comment.  For papers that receive more than one rating, a F1000 Factor is calculated.

Wardle reports that fewer than 7% of articles in his study (103 of 1,530) received any rating, with only 23 receiving a “must read” or “exceptional” rank.

Moreover, F1000 ratings predicted poorly the citation performance of individual articles.  The top 12 cited articles in his study did not receive any rating at all, and many of articles that received a high rating performed poorly.  Wardle concludes:

If, as this analysis suggests, the F1000 process is unable to identify those publications that subsequently have the greatest impact while highlighting many that do not, it cannot be reliably used as a means of post-publication quality evaluation at the individual, departmental, or institutional levels.

Speculating on why an expert rating system claimed to involve the “leading researchers” in science operated so poorly, Wardle offered several explanations.  First, coverage of the ecological literature in F1000 is spotty, with some subfields completely ignored.  Second, he believes there is evidence of cronyism in the system where F1000 section heads appoint their collaborators, colleagues and recent PhD graduates, many of whom share similar views on controversial topics.  The appointing of F1000 raters, Wardle adds, appears to suffer from geographical bias, with North American section heads appointing North American faculty who subsequently recommend North American articles.  He writes:

The F1000 system has no obvious checks in place against potential cronyism, and whenever cronyism rather than merit is involved in any evaluation procedure, perverse outcomes are inevitable.

In a paper published last year in PLoS ONE [2], five members of the Wellcome Trust analyzed the citation performance of nearly 700 research articles receiving Wellcome funding with F1000 scores.  Each paper was also rated by two different researchers at the Wellcome Trust for comparison.

Similar to Wardle, F1000 faculty rated just under 7% of the cohort of papers.  While there was moderate strength correlation (R=0.4) between Wellcome Trust ratings and F1000 ratings, many highly-rated articles identified in one group were completely dissimilar from the other group, and less than half of the most important papers identified by Wellcome raters received no score from F1000.  Allen et al. write:

. . . papers that were highly rated by expert reviewers were not always the most highly cited, and vice versa. Additionally, what was highly rated by one set of expert reviewers may not be so by another set; only three of the six ‘landmark’ papers identified by our expert reviewers are currently recommended on the F1000 databases.

An alternative to Impact Factor?

When articles receive multiple F1000 reviews, their average rating does not differ substantially from their journal’s impact factor, a 2005 study concludes [3].  An analysis of 2,500 neurobiology articles revealed a very strong correlation (R=o.93) between average F1000 rating and the journal’s impact factor.  Moreover, the vast majority of reviews were found in just 11 journals.

In other words, F1000 ratings did not add any new information — if you are seeking good articles, you will find them published in good journals.

While advocates of post-publication review may counter that this type of metric is still new and undergoing experimentation, F1000 Biology has been in operation since 2002, adding new faculty and sections ever since.  In 2006, the company launched F1000 Medicine.

Earlier this year, Lars Juhl Jensen, a computational biologist and author of several PLoS articles, analyzed the post-publication statistics made publicly by PLoS and reported that user ratings correlated poorly with every other metric — especially with citations — and wondered whether providing this feature was useful at all.

Unless post-publication review can offer something more expansive, reliable, and predictive, measuring the value of articles soon after publication may be more difficult and less helpful than initially conceived.

—-

[1] Wardle, D. A. 2010. Do ‘Faculty of 1000’ (F1000) ratings of ecological publications serve as reasonable predictors of their future impact? Ideas in Ecology and Evolution 3, http://dx.doi.org/10.4033/iee.2010.3.3.c

[2] Allen, L., Jones, C., Dolby, K., Lynn, D., & Walport, M. 2009. Looking for Landmarks: The Role of Expert Review and Bibliometric Analysis in Evaluating Scientific Publication Outputs. PLoS ONE 4: e5910, http://dx.doi.org/10.1371%2Fjournal.pone.0005910

[3] Editor, A. 2005. Revolutionizing peer review? Nature neuroscience 8: 397, http://dx.doi.org/doi:10.1038/nn0405-397

Enhanced by Zemanta
Phil Davis

Phil Davis

Phil Davis is a publishing consultant specializing in the statistical analysis of citation, readership, publication and survey data. He has a Ph.D. in science communication from Cornell University (2010), extensive experience as a science librarian (1995-2006) and was trained as a life scientist. https://phil-davis.com/

Discussion

68 Thoughts on "Post-Publication Review: Does It Add Anything New and Useful?"

“Post-publication review is spotty, unreliable, and may suffer from cronyism, several studies reveal”

Couldn’t the same be said of pre-publication peer-review – the traditional process that we all know and love. Why single out post-publication peer-review for criticism? It’s all peer-review and is subject to the same limitations.

Absolutely. But many critics of the traditional process contend that post-publication review is far superior. Very little has been done to evaluate the validity and reliability of post-publication review. This blog post brings together four such studies.

To my knowledge pre-publication peer review does not attempt to predict winners like this. Journals do not rate the articles they publish, so far as I know, they just publish them. The only rating is “worth publishing here.”

It should come as no surprise that science is just as unpredictable as other human activities. Why should a few people be able to predict which papers will be found important by a large number of people over the next five years? It is like fund managers picking long term winners in the stock market. The presumption is unjustified.

David is correct. Moreover, what is also often forgotten in these discussions is that a crucial function of (pre-publication) peer review is to identify gaps and additional experiments a researcher needs to perform to justify their conclusions.

This is not possible post-publication. A reader may thus decide they don’t accept the conclusions because key controls are missing. Alternatively, and more dangerously, they may accept the results unaware that a major question mark remains.

[Note this is a criticism also frequently made (by researchers) of low-impact/’peer-review-lite’ journals].

In defense of Faculty of 1000,
1) I think you may be cherry-picking a bit from the findings of the WT study. Overall, they conclude that

these data do support the concept that mechanisms such as Faculty of 1000 of post-publication peer review are a valuable additional mechanism for assessment of the quality of biomedical research literature.

Not perfect, sure, but .45 correlation with something so subjective ain’t half bad. As David points out, picking a winner in advance is not easy. Remember the JIF isn’t so great at predicting future citations, either; Seglen finds that “The citedness of journal articles thus does not seem to be detectably influenced by the status of the journal in which they are published” [1]. Baysian approaches using early citations are promising [2], but you still need at least a year for them to accumulate.
2) Moreover, the Nature neuroscience article only examines articles within the field of neuroscience. Wets et al. [2] examine multiple disciplines, finding that differences between sets of “core” journals as ranked by f1000 vary significantly between fields–a subtlety the monolithic Impact Factor ignores.

But I think the larger error is in equating “early post-publication evaluation” with “Faculty of 1000.” Increasingly, papers are discussed and evaluated in a diverse media ecosystem. The great potential of post-publication review is in aggregating indicators from across this system. We know that early downloads can be predictors of future citation [3]. What if we also include other forms of early activity around articles, like tweets, blog posts, bookmarks, and inclusion in libraries? Asur and Huberman show an amazing .97 R-squared for a model that predicts movie grosses based entirely on Twitter data [4]. What if we could do the same thing for academic papers?

F1000 is not going to save the world, and it’s not a complete solution. It’s good to be realistic about that, and this post should be commended for reminding us. But it remains a useful approach–particularly as a component of a more comprehensive, aggregated set of early metrics. (For those interested, I make this argument in more detail in a paper I just published [5]).

[1] Seglen, P. O. (1994). Causal relationship between article citedness and journal impact. Journal of the American Society for Information Science, 45(1), 1-11. doi:10.1002/(SICI)1097-4571(199401)45:13.0.CO;2-Y

[2] Ibanez, A., Larranaga, P., & Bielza, C. (2009). Predicting citation count of Bioinformatics papers within four years of publication. Bioinformatics, 25(24), 3303-3309. doi:10.1093/bioinformatics/btp585

[3] Brody, T., & Harnad, S. (2005). Earlier web usage statistics as predictors of later citation impact. Arxiv preprint cs/0503020.

[4] Asur, S., & Huberman, B. A. (2010). Predicting the Future with Social Media. Arxiv preprint arXiv:1003.5699.

[5] Priem, J., & Hemminger, B. H. (2010). Scientometrics 2.0: Toward new metrics of scholarly impact on the social Web. First Monday, 15(7). Retrieved from http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2874/2570

On your Scientometrics suggestions:
It’s difficult to see Web 2.0 metrics being particularly useful anytime soon. Uptake has been very low in most scientific communities and these tools are generally seen as untrustworthy (see tomorrow’s Scholarly Kitchen for a discussion of the latest study to reaffirm this, including this quote: “reader comments and ratings would be so open to abuse that it’s hard to imagine that people would interpret them as a valid [indicator] of the paper’s worth”).

Given that such a small portion of the community uses these tools, and given that their interests tend to lie in particular areas (computer science, bioinformatics), the data produced is likely to be heavily skewed, giving disproportionate influence to a very small subset of researchers, particularly a subset that devotes a significant amount of time to using these resources rather than doing actual research. This may change over time but for the near future, they’re unlikely to accurately reflect influence. They’re also likely to fall prey to even more gaming, bias and cronyism than current flawed metrics.

There’s also the problem of separating popularity from influence. There’s a famous quote from Brian Eno about the band The Velvet Underground:

“The first Velvet Underground album only sold 10,000 copies, but everyone who bought it formed a band.”

Compare the Velvets as an influence on subsequent music to their contemporaries like The 1910 Fruitgum Company who had records that sold well over a million copies. Popularity does not equal influence, hence just adding up numbers of downloads or bookmarks may not be reflective of the work having added inspiration to a scientist’s output. Even worse, you have papers like the one discussed here that received lots of attention and blog postings and such because it was not a great piece of scholarship and many were eager to refute it. Attention paid does not translate into quality.

There is certainly potential there for some of these metrics to be part of a larger system, but so far, most of the proposed ones fall short, and they certainly should not be used on their own.

One sidenote, the study discussed in tomorrow’s post noted that higher usage was seen by senior researchers rather than their younger cohorts.

David, I quite agree with you that usage of social media among scholars is pretty low and varies heavily along disciplinary lines. The academy, like any successful system, resists fundemental change well; Harley et al. [1] discuss how this slows the adoption of 2.0 tools. But it took time for the academy to adopt web pages, email–even the Impact Factor. I’m not saying social media are prevalent right now; I’m saying they’re clearly growing in importance, and we ought to be starting to pay attention.

I disagree that such measures will give disproportionate influence to researchers that spend “…a significant amount of time to using these resources rather than doing actual research.” By that logic, citation-counting gives disproportionate influence to those who publish a lot; communication is an essential part of scholarship, and social media are in increasingly how we communicate. Moreover, the great thing about Scientometrics 2.0 is that we can actually look at impact apart from communication, tracking the inclusion of articles in personal bookmarks like CiteULike or Connotea (Taraborelli [2] discusses this), and personal libraries like Zotero and Mendeley. These may reflect types of impact we’ve never been able to see before, giving us a more nuanced as well as a faster picture of influence.

And while it seems in theory really easy to game social metrics, in practice statistical filters and a vigilant community make it hard; as I point out in the First Monday paper, “Spike the Vote,” a service claiming to be a “bulletproof” way to game Digg ended up sold on eBay. Google manages to be pretty useful, despite extremely lucrative incentives to game PageRank. There will always be an arms war between the astroturfers and the measurers–but at least with social metrics, it would be a visible one, unlike the shady, back-room haggling fostered by the JIF [4].

I agree that whose and what kind of attention is as important as how much attention; this is a problem that the JIF has, too, along with the problem of negative citations. In the short term, Scientometrics 2.0 handles this by ranking types of mentions in different ways: a post on researchblogging.org probably means more than a shout out on 4chan. However, over time these metrics will be able to leverage the persistent identity inherent in most online systems to support collaborative filtering, a la Netflix. Ultimately, we’ll (someday) be able to connect scholars’ identities across different types of services, giving us Google-esque statistical power for personalized recommendation. Today, it’s hard to imagine a Web without Google; in the future, it may be an integrated social-metrics based system like this that plays the same role for scholars.

Of course, this is some ways in the future. And even today, though, I think these metrics may be valuable, when you consider the competition: the JIF is hemorrhaging credibility, and citations take years to accumulate. We recently did a study (should be appearing in August) finding that nearly half of Twitter links to peer-review articles appeared in the first week after publication; that’s real, immediate data we shouldn’t be ignoring. Combined with other sources, this could be powerful data—and it’s likely to become more so over the coming years.

[1] Harley, D., Krzys Acord, S., Earl-Novell, S., Lawrence, S., & King, C. J. (2010). Assessing the Future Landscape of Scholarly Communication: An Exploration of Faculty Values and Needs in Seven Disciplines. Center for Studies in Higher Education, UC Berkeley. http://escholarship.org/uc/item/15x7385g

[2] Taraborelli, D. (n.d.). Academic Productivity » Soft peer review? Social software and distributed scientific evaluation. Academic Productivity. Retrieved November 25, 2009, from http://www.academicproductivity.com/2007/soft-peer-review-social-software-and-distributed-scientific-evaluation/

[3] Arrington, M. (2007, April). Next Service To Try Gaming Digg: Subvert and Profit. TechCrunch. Retrieved May 6, 2010, from http://techcrunch.com/2007/04/02/subvert-and-profit-next-service-to-try-gaming-digg/

[4] The PLoS Medicine Editors. (2006). The Impact Factor Game. PLoS Med, 3(6), e291. doi:10.1371/journal.pmed.0030291

I’m not sure how relevant the adoption speed of things like e-mail and web pages is to the adoption of Web 2.0 tools for a variety of reasons. Most importantly, e-mail and use of the web both offered new efficiencies, ways to save time in a researcher’s busy schedule. E-mail provided a faster and less time consuming method of exchanging information than snail mail or talking on the phone. Gathering information via web pages is more efficient than going down to the library and making Xerox copies. The vast majority of Web 2.0 is about investing more time, doing more work. It’s broadly a tremendous timesink and very few tools offer any actual increase in efficiency. Given the ever increasing demands on researchers’ time, more burdens are unwelcome.

According to the folks at Mendeley, there are at least 2-5X as many people who use their desktop software but never go online and don’t have accounts at the Mendeley site. There’s no incentive to share your reading list with anyone, so why do the extra work required? Those tools that provide any real benefit offer up the same level of benefit to lurkers that they do to active participants. There’s no incentive to spend time adding content. So while there were great efficiencies to be gained from things like e-mail, the same can’t be said for writing a blog or using Twitter.

Also, unlike e-mail or the web, some Web 2.0 tools are already viewed negatively in some ways–as being less serious and less trustworthy than peer reviewed means of communication. That’s a big hurdle to clear that’s going to slow if not stop any potential adoption. I take the opinion that the tools that are really going to catch on don’t yet exist. It’s hard to predict how useful they may be in assessing impact.

Even if uptake of current tools does suddenly spike, use will likely follow the use seen in every other category of life, best described by Jakob Nielsen’s 90-9-1 rule. If the vast majority of content is created by 1% of users, and that content is driving the measurement of achievement in science, doesn’t that put a huge amount of power in the hands of that 1%? The difference between this and counting citation is that the small group who are writing citations in their published papers are people who are actually advancing science and publishing new discoveries. Blogging requires no such achievement, and the more time one spends blogging instead of doing experiments, the more power one is likely to wield. If you want to measure the real impact a work has, doesn’t it make more sense to look at works it ends up inspiring, rather than conversations about it? This seems to welcome advocacy as a driving force for measuring science, rather than quality.

And that’s the really important question with any metric–what exactly are you measuring? There’s a huge difference between a citation in a paper that says, “I did this new work and I could not have done it without this previous work” and a download (“I am interested in this paper and may or may not read it, and may find it useful or may find it to be awful”) or a bookmark in Connotea. How often do you bookmark things you meant to come back and read later but never quite get around to reading? Again, popularity and interest does not translate into impact.

Until we have vastly better semantic tools, it will be impossible to rank different types of mentions in different ways. I can write a scathing review of a bad paper on ResearchBlogging just as easily as I can write a glowing review of a brilliant paper. Each counts the same as far as the metrics go.

The inherent problem with social metrics is that they are often a better measure of the social nature of the work or the person doing the work than the quality of the work itself. If I write an extremely popular blog, filled with humorous anecdotes and awesome pictures of my cats, and I have a huge following, if I blog about my latest paper, odds are that it is going to get more downloads and comments and bookmarks than if I’m a quietly brilliant researcher who spends all his time at the bench and doesn’t even own a cat. Social networks select for social networking skills, not for quality science. That’s how the results get skewed and gamed, not necessarily through bots or hiring third parties, but through exploiting the very nature of social networking itself. I’m more interested in rewarding and funding the class genius rather than the homecoming queen.

I understand that the impact factor is badly flawed, but I’m not sure any of these metrics are any better, and I’m not willing to wait a decade before enough people might possibly maybe use them enough to make them meaningful. I do think there’s some really interesting work being done in creating new metrics and I do think the likely solution is going to be some combination of multiple metrics

I’m afraid we’re hijacking the original post, so I’ll be brief: academics made Web pages and got email not because it saved them time, but because they didn’t want to be left out of the loop. I think it’s likely that some forms of social media will be the same. And conversation is by no means the gold standard of impact–you’re right, that takes years to determine–but it may be a good leading indicator. Neither of us will know until there’s more research, which is why it’s great for Philip to be discussing these early F1000 studies.

Thanks for your good thoughts, and the last word is yours, if you want it 🙂

Reading your post David C., it strikes me that the problems with citation are the same as those you describe with web 2.0, because citation is itself a social medium. In fact citations are often dug up after the research is done, simply to tie the results back to the community. As a measure of quality they are a poor proxy.

But I’d prefer the term importance to that of quality because we are usually talking about the results of the work, not how the work is done. One can’t control the importance of one’s results.

Fair enough David–the social nature of citations are indeed problematic. That’s why it makes no sense to replace that as a metric with something even more reliant on social ties. You don’t fix a problem by compounding it.

Just to quickly reply to the comment above: “There’s no incentive to share your reading list with anyone, so why do the extra work required?”

There’s quite a bit of incentive to create and share reading lists on Mendeley. Coordinating the efforts of large and widely distributed research organizations are a primary use case for sharing reading lists. Don’t take my word for it, though. Take the word of the users, who have created and shared hundreds of thousands of reading lists already.

Creating a directed reading list for a particular working group is a different use case than an individual spending lots of time sharing his reference list with the world.

It argues more for a focused collaborative tool for a known group than a wide open network that wouldn’t provide the same benefits in return. Sure, there’s no downside in that group letting others see the reading list, but likely the benefits will be reaped internally, not from outside of that working group. Doesn’t Mendeley keep those sorts of groups private/invitation-only?

“In other words, F1000 ratings did not add any new information — if you are seeking good articles, you will find them published in good journals.”

To be fair I think the point is that if F1000 can do the same job as good journals then why have good journals?

F1000 does not aspire to rate the entire corpus of scientific literature. Indeed the Wardle (ecology) and Allen (Wellcome) articles report coverage of under 7% of their study articles. And in both studies, F1000 was also unable to identify many of the highest performing articles.

Given these results, do you still wish to replace the journal system with F1000?

“Given these results, do you still wish to replace the journal system with F1000?”

I don’t personally but I also don’t think that it’s entirely reasonable to conclude that post-publication review as a whole is less helpful than initially conceived based on the first few years of F1000, as if anybody (except perhaps BMC) are pitching it as the be all and end all of post publication filtering. 😉

Bearing that in mind the 7% figure is a bit of a red herring. Presumably as an F1000 customer I only want to be alerted to good papers my other filters might miss – the bulk of the literature will be irrelevant.

Euan,
F1000 Biology has been around since 2002 and F1000 Medicine since 2006. At what point do we stop arguing that we are still in the testing period and start the evaluation?

Perhaps I misunderstood your point about replacing journals with F1000. The real value of F1000 is not replicating the brand signaling that is already going on in the top journals: It is alerting readers that there are important papers being published in specialist and archival journals.

Currently, F1000 is focused on the former and is highly deficient in the latter.

Euan,
It’s worth noting that Faculty of 1000 has not been part of the same group of companies as BioMed Central since 2008, when BioMed Central (but not F1000) joined Springer.

That said, I think that Phil Davis’s criticisms of F1000 are decidedly wrong-headed. For one thing, he argues that the F1000 ratings correlate too well with journal impact factor, so “contain no new information”, but they don’t correlate well enough with future article citations, so are unreliable.

He seems to want to have it both ways. It would certainly be very surprising if F1000 evaluations were *not* predominantly from high impact factor journals. The subjective evaluative judgments of expert faculty members and the subjective judgements of expert academic journal editors should be pretty highly correlated. But highly correlated does not mean ‘no new information’.

What is especially interesting about F1000 is (a) the unusual articles from non-high impact factor journals which nevertheless get exceptional ratings – the so called ‘Hidden Jewels’, and (b) the additional information gained when several faculty members express differing subjective opinions about an article – something which in contrast is rarely made visible when a journal editor makes a simple ‘yes or no’ decision to publish an article.

All this is extra information at the article level, despite the fact that *overall* article ratings correlates pretty well with impact factor.

As for the fact that F1000 Factor ratings do not correlate perfectly with future article citations… Phil’s post seems to imply that the citations of a single article are a perfect proxy for (or define the one true objective measure of) an article’s “quality”. But who really thinks that is the case? Citation stats for individual articles are affected by all kinds of extraneous factors, and if you ask a researcher to point to their best work, or a journal editor to point to the article they are most proud of having published , it is often not the most highly cited.

Matt,
If you read the papers I cite in detail, you will realize that their results are consistent with each other.

The Nature Neuroscience study started with 2,500 recommendations published in 200 journals, but focused their analysis on the articles receiving 50+ reviews. This subset consisted of two-thirds of all recommendations that were published in just 11 journals.

In comparison, the Wardle and Allen studies report on their full set of study articles.

It shouldn’t therefore be surprising that the articles receiving a high number of recommendations were published in journals with high impact factors. Those “hidden jewels” are not included in the Nature Neuroscience analysis.

By the way, I’m a little unclear on your relationship with F1000, considering that your company is now owned by Springer. Could you elaborate?

“By the way, I’m a little unclear on your relationship with F1000, considering that your company is now owned by Springer. Could you elaborate?”

I have no formal relationship with Faculty of 1000, though I did play role in its creation, back in the days when BioMed Central was part of the same group.

BioMed Central is one of a number of publishers and societies who are affiliated with F1000 and link to F1000 evaluations. See http://f1000biology.com/about/affiliates

Ah, I think I see what you’re getting at now.

IMHO when people use the term “post-publication review” nowadays they’re thinking about scenarios like:

“Authors publish in a high volume ‘as long as the science is sound’ repository like PLoS One and then use systems like F1000 (or commenting, or blogs, or…) to tell them which papers are actually worth reading”

… i.e. replacing the editorial selection function of top journals with something more distributed post-publication is exactly where they see F1000 being useful.

But you’re talking about here & now and whether or not post-publication *complements* the existing pre-pub quality review system.

I think I summed up my argument pretty well in my last statement:

Unless post-publication review can offer something more expansive, reliable, and predictive, measuring the value of articles soon after publication may be more difficult and less helpful than initially conceived.

Many of the proponents of the “publish everything and review it later” camp you describe view journals (and their editorial board) as wielding too much power on the fate of manuscripts, providing evidence to show that their decisions are largely arbitrary. In essence, their argument is anti-hegemonic: Its about destroying the power structure that allows individuals to use power inappropriately and eradicating the forces that perpetuates this system.

And yet, the selection of “experts” (various F1000 sites list 5000, 8000 and 9000 reviewers), most of whom are also editors and reviewers in the existing journal system seems to reaffirm and legitimize this same power base.

Measuring the value of articles soon after publication is impossible,as any study of the history of science will show, so why do you all insist on even trying.

Replying to David Crotty:
“Creating a directed reading list for a particular working group is a different use case than an individual spending lots of time sharing his reference list with the world…Doesn’t Mendeley keep those sorts of groups private/invitation-only?”

No, it’s exactly the same scenario. Public collections on Mendeley are just that – collections (aka reading lists) which are public, appear on your profile, can be embedded on other sites, and even have RSS feeds.

A bit late to the party as I catch up with the feeds. Skimming through, the summary would appear to be that relative F1000 ratings do not perfectly match relative citation counts — therefore F1000 ratings are wrong and useless.

Lord protect us from the idea that an academic publication might have any value beyond its ability to accumulate citations.

To address the absence of significant online discussion– “post-publication peer review”–I founded The Third Reviewer. All the major publications in many journals are indexed, and anyone can leave comments–anonymous or otherwise–on every paper indexed. Thus, the back-scratching prevalent at F1000 is replaced by candid discussion.

We started out with just neuroscience and have recently expanded to microbiology, with more fields to come. I strongly believe that honesty is best achieved by (optional) anonymity, and that’s the only way to drive serious online discussion so that post-publication peer review can supplement, if not replace, the fancy-journal hierarchy (hence the name of our site).

I’d welcome any feedback you have (neuroreview, at, gmail).

Martha,
Is a system that allows anyone to comment on a paper — anonymous or not — really a form of “peer-review.” Where is the “peer” in “peer review?”

I think “post-publication review” is much more appropriate. My 6-year old daughter could give my paper 5-stars with the comment “great article!!!!!” While this is a form of review (nepotistic in nature), I don’t consider her a peer — at least not yet.

Perhaps your daughter could, but would they? Most likely, the people with the desire and motivation to comment would actually be peers.

For what it’s worth, I recently had to chance to speak with Martha about the site. She struck me as someone who’s just trying to solve a simple problem in her discipline, that of poor public commentary, and doing the obvious thing to address that. So call it what you like, but don’t dismiss it before giving it a chance.

Mr. Gunn, You’re missing my point. This is not about Martha’s new website but about language.

Peer-review is a validation process undertaken by qualified peers, hence the term “peer review.” There is nothing about the process of allowing anyone (even those who wish to remain anonymous) that guarantees that the evaluation was done by qualified peers, which is why “post-publication peer review” is not an appropriate term.

Why not just evaluate a comment for whether it not it contributes substantively? If it does, then the commenter is de facto your peer, regardless of whether s/he has the right degree from the “right” institution.

The simple fact is that commenting rates on science papers have been utterly dismal, with zero comments on the vast majority of papers even in high-profile journals (just check out the Nature website for evidence). So why not get a site rolling that’s getting a lot of comments, and *then* worry about moderation and quality issues? It seems strange to worry about possibly underqualified strangers damning your paper online when there’s nobody talking about your paper online at all.

The other issue is: people are having these conversations behind your back on an ongoing basis (“What’d you think of X’s paper?” “Oh, it sucked.”) I think most authors would prefer to have those discussions be open, so that the authors are able to respond to them, than have them be private, where the authors can’t defend themselves. And in most cases, people are only willing to speak unpleasant truths if they can be anonymous–or else we’re back to the back-scratching of F1000. (Didn’t mean to pick on it, it was just my only tie-in to your original thread!)

The whole issue of evaluating the quality of a review or comment has been dealt with on a lot of websites by including some sort of widget for rating others’ comments. You use just such a widget here, for example. Really popular sites, like slashdot, even use those ratings to make low-rated comments invisible.

So of course your daughter can write a comment (though I wouldn’t be as sanguine as you about her giving your article five stars….kids are a little devious like that!) but the value it adds to the conversation can then be rated by other users.

Like Mr Gunn, I just don’t imagine that a whole bunch of non-scientists are going to want to leave comments on a science article site–but frankly, so what if they do? If the comments are worthwhile, great, and if not, they’ll get voted down.

Earlier this year, the U.S. Office of Science & Technology Policy (OSTP) launches such a comment site where readers could comment as much as they liked and give thumbs-up or thumbs-down ratings to other commenters. If you received two or more negative votes, your comment was collapsed from view (although not deleted).

Within a very short time, the site was dominated by Stevan Harnad, who appropriated the site as his personal forum.

Commenters with contrary views were pretty quickly voted down. I don’t see how this aided open discussion. If anything, it allowed a public meeting to be hijacked by a single individual with a deliberate, subversive mission.

Eh, that’s just bad site moderation. I’m unclear from your description how one person managed to vote down everybody else, but in any case–it’s really not difficult for a thoughtful side moderator to keep on top of these things.

Philip- Just as a disclaimer, I’m working with Martha on developing the Microbiology section of The Third Reviewer. This site is very different from faculty of 1000, and actually I wouldn’t characterize it as post publication peer review but instead as a broadened discussion.

In any case though, we already do many forms of anonymous review of scientific data- and although the chosen reviewers should ideally be peers, and we call them peer reviewers, this is not always the case. Everyone who has ever participated in peer review of papers or grants, or who has ever received a review knows this at some level. We simply do the best we can but the system is not perfect. There is no reason a priory why Third Reviewer should be any different than the current system- competent review, questions, and discussion most of the time.

I understand your example up there- but again, I’d have to say- a single example does not a data set make.

Helene and Martha,
In my last example, I was attempting to illustrate that post-publication review/ratings does not necessarily equate to “peer review.” And secondly, that review systems which allow individuals to be anonymous (or create pseudonymous identities) are prone to gaming and corruption.

I wish you luck with your new post-publication service.

Martha–many sites are choosing to let the community serve as the moderators on comments (some for ideological reasons, others to avoid legal liabilities). If what you’re seeing on a community-moderated site is bad moderation, then it’s a bad community, one looking to stifle the open interchange of ideas and instead hew strictly to one ideology.

That can be difficult to avoid without wielding a heavy, controlling, editorial hand.

David, agreed entirely. It’s also frankly hard to imagine how any one scientist could succeed at dominating the discussion in multiple topics within neuroscience, let alone across disciplines! I suspect that reader voting will take care of this without moderator intervention–but I just wanted to point out that moderation is always a tool of final resort if there are festering problems like the one Philip described.

Martha–the problem is a commonly-used system where when a few people object to or disagree with a comment, it becomes invisible. That allows a small group, or even one individual with multiple accounts, to erase any dissenting comments. And those with an agenda are likely to put in the work to do this, while those just commenting for fun won’t bother.

Yeah, but I think we’ll worry about it when we get there. Right now science discussion online just isn’t happening. Our theory is to set up a forum where it might actually take off, and then worry about these problems. Probably by then there will be some amazing new WordPress plug-in to deflect all comments originating from Monsanto/”Intelligent design” IPs! I hold out hope.

The problem though, is in establishing a new communication medium. If a system is set up and it ends up being seen as untrustworthy and dismissed by the community, it’s going to be hard to get a second chance.

The better you can make your system, the more reliable and trustworthy it is from the get-go, the better. Once credibility is lost, it’s almost impossible to win back.

Yes, our current plan is to leave all comments visible, regardless of the number of up or down ratings. Thus the ratings function to provide a little feedback but they cannot come to dominate the site’s discussion. If you’re slashdot and getting hundreds of comments, there’s much more incentive to create a comment hiding function, but frankly I’d be astonished if even the most hotly debated paper in microbiology were to reach a hundred comments. It’s a far far smaller user pool, and I think that works to our advantage in some respects.

The Third Reviewer is certainly a site worth watching to see if it can overcome the issues that have dogged other commenting efforts, and if it can achieve some traction in the community.

I’m not sure how it really differs from other efforts though, at least in substantive ways that will eliminate the barriers that currently inhibit commenting. As with all other efforts, there’s no actual incentive for participation. Scientists are busy people, and the job market and funding gets tighter and tighter every day. Spending a lot of time doing something that has merit but that does not help one’s career can be a dangerous path. As such, the best commenters, the people you’d really want participating, are likely far too busy with other aspects of their research to be active in the online community. Ask most researchers if they’d rather put a few hours into crafting a careful response to someone else’s paper or putting those same hours into writing their own paper or doing more experiments, and the answer is fairly obvious. Instead of the best and the brightest weighing in, you’re more likely going to see activity by those with nothing better to do.

Anonymity is an interesting wrinkle, as it allows the potential to overcome some of the social barriers to commenting. People are afraid of leaving negative comments that might hurt their careers if they are identified. But anonymity comes with a downside and you get the same problems seen on other rating sites like Yelp or Amazon. Given the skepticism with which scientists approach the world, it’s hard to see much credence given to anonymous comments. One would automatically start with assuming they’re either shills for the author or competitors with an axe to grind.

As noted in this recent study, online material that hasn’t been through a rigorous peer review process is not trusted by the scientific community. And I think that attitude is going to apply to anonymous comments as well.

I’m not so sure I see the problem with Amazon comments–I often use them to help decide between two products (just last night in fact, we selected a potty seat for my son on precisely that basis…). And what is the “incentive” of Amazon commenters? That they bought a product and want others to know about its qualities–not really so different than a scientist who has read a paper and wants others to know that it is or is not trustworthy.

You ask whether a scientist will want to put “a few hours” into crafting a careful response. Gracious, probably not. But if I’ve already read and thought about a paper, it takes about five minutes to write a decent, substantive comment: “The authors rigorously show X and Y. However I can’t interpret their claim Z without seeing controls for P.”

What’s my motivation for doing so? Surely you’re familiar with SIWOTI syndrome… Research is inherently competitive, and people who feel that a published paper is getting away with unmerited claims may quite well be willing to spend an extra 5-10 minutes to tell others that.

The issue of “trusting” anonymous comments seems to weigh more heavily on some people than others. I don’t need to trust a comment; I only need to read it, and then evaluate it myself. Such evaluation is best done, imho, without regard to the source–that’s the best way to eliminate entrenched biases against certain people (or genders, or ethnicities, for that matter).

Amazon reviews are not anonymous, nor are the comments. You have to purchase something on Amazon to review there, and while you can change your avatar’s name, Amazon knows who you are and can deal with problems because of this.

That said, I had an interesting experience recently there. I read a book everyone was raving about, and while the first 2/3 of it were terrific and riveting, the last 1/3 was a total disappointment — an unfulfilled promise. On Amazon, customers were routinely giving it 5 stars, but I gave it 3 stars. I was pretty let down by the last 1/3, and I think the author ultimately failed in her aim. Anyhow, I posted my review, which was fair and accurate from my perspective, and quickly got comments on my review basically telling me to shut my face about this exquisite book. I decided to stop the harassment, and took down the review. But it bugged me, so I tried again, posting another review a few days later. Once more, comments started appearing from others harassing me about my review. I bailed, and just let it go at that. But I was definitely beat up by a clique of fans. Actually, if I’d been anonymous (if Amazon hadn’t emailed me each time someone commented on my review, for instance), my review might still be up there.

As more people start using social media, we’re going to need more etiquette lessons. I think there’s an opportunity for a Miss Manners of Social Media.

Yeah, clique concerns are something we think about, though again the relative small number of scientists probably works in our favor here. (Admittedly, they can be vocal far out of proportion to their numbers…) But perhaps if the comment you’d left was truly anonymous, you’d have let it stay?

The problems with Amazon review are well documented. A glitch in their Canadian site a while back gave a great deal of insight into how many of their reviews for books were written by the authors themselves. And a recent scandal shows that academic authors are not immune from this sort of gaming. Given the high stakes of career advancement and funding, I find it hard to believe that scientists would somehow not act like all other humans on earth.

If I’m going to post a serious review of someone else’s work and have my name permanently attached to that critique, then I’m certainly going to spend a good amount of time making sure I get my fact straight. If all the site aims for is gossip or seat of the pants comments that aren’t well thought-through, then it’s not going to offer a great deal to the reader.

The idea of stubbornly arguing on the internet is only exciting to a fairly small percentage of the population, just as very few people blog or leave comments on blogs. Again, you’re biasing the conversation toward those who like to spend lots of time arguing online. No problem with that, as I’m one of those people, but it leaves out the vast majority of scientists, and particularly the scientists that you really want commenting. The smartest scientists are the busiest ones, and the joys of arguing with strangers is not enough incentive for the vast majority of the population.

Trusted voices are valuable filtering mechanisms in an age of information overload. As noted, anonymity has its place as it allows a freedom of speech but at the same time, it reduces credibility.

The issues of scientists trashing each other’s work from behind a cloak of anonymity, however, are found elsewhere as well: pre-publication peer review. It’s quite clear that scientists will sometimes trash or delay others’ work in order to bring out their own study on the same topic. So perhaps you’d like to abolish the current peer-review system too?

It is indeed a problem elsewhere. But you don’t solve a problem by implementing a system that makes bad behavior even easier.

In pre-publication review, the reviewers are not anonymous–the editor who solicited them knows exactly who they are. Problems can often be solved by a good editor, one who chooses reviewers wisely, is able to read between the lines, integrates the opinions of multiple reviewers well and takes author objections seriously. There’s no perfect solution, but it is an area where oversight can at times be helpful.

Ok, but if my son’s new potty totally rocks, you’re all disproven.

Fair enough. Personally, I’ll stake my reputation by giving a non-anonymous recommendation to the Baby Bjorn trainer seat. All the foam padded ones we used soaked up urine like a sponge. The Baby Bjorn is hard plastic and easy to clean. Good luck!

That’s the one we ordered!!!!!!!

Because a whole bunch of people we don’t know rated it highest.

One other obvious worry about the site is the disconnect from the actual published paper. Most people still download the pdf version of the paper, print it out, and read it that way. Even if they’re reading the html version online, most will still remain unaware that there’s a discussion going on elsewhere. Requiring readers to go through extra steps, to head to another site and then search for the paper they’ve just read to see if there’s any discussion is a fairly big hurdle.

Definitely an issue. Two points:

1) Right now if people want to comment on a paper, the options for doing so are totally decentralized (ie journal websites, of which most scientists visit a dozen or so). A central aggregator site like ours could lower that threshold to participation.

2) The site’s still in its infancy. We’re working on several tools that would give greater visibility both on social media platforms and on the journal websites themselves. Stay tuned..

I really appreciate all this feedback–it’s useful to help us prioritize site improvements.

Martha and Helene–

Just wanted to add a postscript–I do think what you’re doing is valuable, open discussion of scientific results is very important, and it’s something that’s disappeared from the culture over the last twenty years. Science is worse off because of this.

I’m just skeptical that this can be restored without sweeping changes to the culture, particularly changes to the way funding and careers impact a researcher’s behavior. Sadly, things seem to be going in the opposite direction, as pressures are increasing, rather than decreasing, and activities that are good for science but not necessarily career-advancing are becoming even less of an option. I’d love to see you succeed, I’m just not sure how to pull something like this off.

Thanks, David. I certainly understand the concerns about motivation for participation. We don’t know if this will work either, but we think it’s worth a try.

Thanks David-

I appreciate your saying that. You are right, we are looking for a big change. I feel like we have to start somewhere though, and sometimes big changes happen as the result of a little change instituted by someone who just wouldn’t take no for an answer.

As for career advancement and the things that get one there. I guess there are the traditional metrics for this in science (ie. papers and grants), and sometimes that is all people see. I do know that every time I’ve done some activity like this that didn’t seem to have a clear benefit to my scientific career up front- there turned out to be some twist or benefit that I never could have predicted in advance.

Comments are closed.