English: Vladimir Kush looking through a cryst...
Looking through a crystal ball. (Photo credit: Vladimir Kush, Wikipedia)

Can experts accurately predict the future citation performance of articles just months after they are published?

A new study of F1000 recommendations suggests that article ratings do have some predictive power, but they are far less accurate than relying on the citation profile of the journal.

The study, F1000 recommendations as a new data source for research evaluation: A comparison with citations,”  by Ludo Waltman and Rodrigo Costas, both from the Centre for Science and Technology Studies at Leiden University in the Netherlands, was posted on March 15, 2013, in the arXiv.

Waltman and Costas began with more than 100,000 article recommendations in the F1000 database, where F1000 faculty reviewers rate the biomedical papers on a three-point scale (1= good, 2 = very good and 3 = exceptional). The vast majority of papers (81.1%) received just one recommendation, with the average paper receiving just 1.3 ratings.

The F1000 recommendation system was not designed to evaluate the entire biomedical literature but to guide readers to the most important papers. Based on various estimates, including its own, F1000 contains recommendations to about 2% of the biomedical journal literature.

The vast majority of F1000 recommendations take place soon after publication, with more than 80% being submitted within four months of article publication; fewer than 10% of recommendations are submitted six or more months after publication.

Limiting their analysis to papers that were published between 2006 and 2009, Waltman and Costas were interested in whether these early F1000 recommendations were able to predict article citations within three years of publication. To do this, they gathered citation data from Web of Science for all the articles that received a recommendation. They also gathered citation data for the 98% of articles that didn’t receive a recommendation, to provide a control group.

How did F1000 faculty reviewers do?

In general, higher-rated papers generally received more citations. Those papers that received a maximum rating of 3 (“exceptional”) did better than papers receiving a maximum rating of 2 (“very good”), which did better than papers receiving a maximum rating of 1 (“good”). Papers that received any rating generally did better than the 98% of papers receiving no rating. And papers with more recommendations generally did better than papers with fewer recommendations. Not surprisingly, papers with more recommendations were also published in more prestigious journals. The paper with the most recommendations was published in Nature.

One interpretation of this finding is that F1000 faculty were able to predict more successful and highly-cited papers. This finding would be interesting if the biomedical literature were published randomly across thousands of journals or deposited in a giant biomedical archive. However, comparing the predictive value of F1000 recommendations to the citation performance of each journal, Waltman and Costas found that F1000 faculty did considerably worse in predicting the future citation performance of articles. F1000 reviewers also did a worse job identifying the most highly-cited articles. They write:

Based on the results of our precision-recall analysis, we conclude that JCSs [Journal Citation Scores] are substantially more accurate than recommendations not only for predicting citations in general but also for the more specific task of predicting the most highly cited publications.

In other words, if you want to find good articles, you are better off using journal prestige as your guide rather than relying on a relatively small group of faculty experts.

Waltman and Costas were not surprised by these results. They argue that F1000 recommendations cannot be expected to correlate very strongly with article citations in a system that ignores 98% of biomedical articles. An unrated article could mean one of two things — that it was a bad paper or that it was simply not rated by one of the F1000 reviewers. The second explanation is more likely, as nearly three-quarters of the Top 1% most-cited papers did not receive any recommendation.

The most plausible explanation for their results is that F1000 is a poor predictor of important papers. Uneven coverage of the literature may offer some explanation. For instance, coverage of the field of cell biology was much stronger than surgery. F1000 faculty may also preferentially recommend papers promoting their own interests, such as papers published in one’s own journal. Waltman and Costas also wonder whether the process of selecting F1000 faculty, based on peer-nomination, may lead to biased reviews.

An alternative explanation is that F1000 and citations are simply measuring two different phenomena — F1000 recommendations are measuring the opinions of a select group of experts while citations are measuring the collective behavior of authors.

Whichever explanation you find more convincing, the analysis strongly suggests that early expert ratings are a poor predictor of future citations — far worse than simply using journal citation metrics as a guide.

Enhanced by Zemanta
Phil Davis

Phil Davis

Phil Davis is a publishing consultant specializing in the statistical analysis of citation, readership, publication and survey data. He has a Ph.D. in science communication from Cornell University (2010), extensive experience as a science librarian (1995-2006) and was trained as a life scientist. https://phil-davis.com/

Discussion

13 Thoughts on "Can F1000 Recommendations Predict Future Citations?"

My interpretation is that the journal system does a good job of sorting research results by importance. This may be its highest value. Good to know given that it consumes millions of peer review hours a year. Not surprising that F1000 cannot compete.

To me, F1000 would be a very useful tool if its faculty highlighted excellent articles published in marginal journals. At present, the vast majority of reviews are to articles published in highly-ranked journals, which doesn’t provide any additional information to readers. Nevertheless, I don’t think it is reasonable to ask volunteer F1000 faculty to troll the marginal literature to seek out and review exceptional articles. This is a structural limitation of their system.


Figure from: http://scholarlykitchen.sspnet.org/2012/01/27/size-and-discipline-bias-in-f1000-journal-rankings/

Yours is a very interesting idea, Phil. Discovery rather than ranking. It is the kind of thing faculty do well, each with their own special expertise. Make the focus narrow not broad. I love it!

“To me, F1000 would be a very useful tool if its faculty highlighted excellent articles published in marginal journals. ”

THAT was their original idea (or one of their original ideas anyhow), they used to have a highlight category called “hidden gems”… . Alas, this was lost along the way…

Good point. Do you (or anyone at F1000) know why Hidden Gems went away? Was it because there were so few submitted by F1000 faculty?

We still have Hidden Jewel listings – both overall and by primary subject area. They’re included in our Article Rankings area (http://f1000.com/prime/rankings/hiddenjewels). They used to be featured on our homepage. Looks as though it might be helpful for our users if we made them more visible again.

Readers of this article might also be interested in the recently published article: “Can Tweets Predict Citations?” http://www.jmir.org/2011/4/e123/. The study applied to the Journal of Medical Internet Research, and for their sample they found that highly tweeted articles were 11 times more likely to be highly cited than those articles receiving low tweets.

Of course, tweets are just small slice of a much more complex emerging picture of assessment. The more post-publication data we gather the better we’ll get at predictive modelling for author and article performance. It’s interesting to speculate on the potential impact of that on selectivity for publishers and readers in the future.

Speculation and breathless optimism are fine, but it is studies like Waltman and Costas that are required to give us a realistic sense of how these services are working. As for the tweets article, we covered it here in some detail.

Based on Priem et al. (arXiv:1203.4745v1 Altmetrics in the Wild), I tend to view F1000 recommendations as not necessarily “citable” references. Some articles are important to comprehend a research area but don’t fit the usual reference for an article. “How To Chose A Good Scientific Problem” by Uri Alon, for example, has many readers in Mendeley, but only 11 citations. It’s a nice piece of reading that has helped me to plan my carrer, but I can’t cite it in any research article.

The value of citation as a metric varies quite a bit from field to field. For some engineering fields, which are more about solving problems than exploring hypotheses, a correct and complete answer to a real world problem may be more of an end than a springboard to further research, hence not a lot of citations. For clinical practice and methodological articles, impact may be felt by millions of patients if the article changes practice for the better, but since it’s affecting patients and clinicians, rather than researchers, citations may not show this.

That said, in many fields, particularly those covered by F1000, citation serves as the key career advancement and funding metric. For many fields, it remains the best available measurement of whether the work inspired and informed further discovery.

Presumably the F1000 reviewers have limited access to full text articles, so their reviews will be skewed towards the more prestige titles, which they are more likely to have access to, and largely ignore more niche or specialised titles which they have no access to.

Interesting presumption. I would have thought that an elite faculty residing at top research schools would have rather good access to the specialized literature.

I would think they can always get a promising looking article from the author. That is what I do.

Comments are closed.