Can experts accurately predict the future citation performance of articles just months after they are published?
A new study of F1000 recommendations suggests that article ratings do have some predictive power, but they are far less accurate than relying on the citation profile of the journal.
The study, “F1000 recommendations as a new data source for research evaluation: A comparison with citations,” by Ludo Waltman and Rodrigo Costas, both from the Centre for Science and Technology Studies at Leiden University in the Netherlands, was posted on March 15, 2013, in the arXiv.
Waltman and Costas began with more than 100,000 article recommendations in the F1000 database, where F1000 faculty reviewers rate the biomedical papers on a three-point scale (1= good, 2 = very good and 3 = exceptional). The vast majority of papers (81.1%) received just one recommendation, with the average paper receiving just 1.3 ratings.
The F1000 recommendation system was not designed to evaluate the entire biomedical literature but to guide readers to the most important papers. Based on various estimates, including its own, F1000 contains recommendations to about 2% of the biomedical journal literature.
The vast majority of F1000 recommendations take place soon after publication, with more than 80% being submitted within four months of article publication; fewer than 10% of recommendations are submitted six or more months after publication.
Limiting their analysis to papers that were published between 2006 and 2009, Waltman and Costas were interested in whether these early F1000 recommendations were able to predict article citations within three years of publication. To do this, they gathered citation data from Web of Science for all the articles that received a recommendation. They also gathered citation data for the 98% of articles that didn’t receive a recommendation, to provide a control group.
How did F1000 faculty reviewers do?
In general, higher-rated papers generally received more citations. Those papers that received a maximum rating of 3 (“exceptional”) did better than papers receiving a maximum rating of 2 (“very good”), which did better than papers receiving a maximum rating of 1 (“good”). Papers that received any rating generally did better than the 98% of papers receiving no rating. And papers with more recommendations generally did better than papers with fewer recommendations. Not surprisingly, papers with more recommendations were also published in more prestigious journals. The paper with the most recommendations was published in Nature.
One interpretation of this finding is that F1000 faculty were able to predict more successful and highly-cited papers. This finding would be interesting if the biomedical literature were published randomly across thousands of journals or deposited in a giant biomedical archive. However, comparing the predictive value of F1000 recommendations to the citation performance of each journal, Waltman and Costas found that F1000 faculty did considerably worse in predicting the future citation performance of articles. F1000 reviewers also did a worse job identifying the most highly-cited articles. They write:
Based on the results of our precision-recall analysis, we conclude that JCSs [Journal Citation Scores] are substantially more accurate than recommendations not only for predicting citations in general but also for the more specific task of predicting the most highly cited publications.
In other words, if you want to find good articles, you are better off using journal prestige as your guide rather than relying on a relatively small group of faculty experts.
Waltman and Costas were not surprised by these results. They argue that F1000 recommendations cannot be expected to correlate very strongly with article citations in a system that ignores 98% of biomedical articles. An unrated article could mean one of two things — that it was a bad paper or that it was simply not rated by one of the F1000 reviewers. The second explanation is more likely, as nearly three-quarters of the Top 1% most-cited papers did not receive any recommendation.
The most plausible explanation for their results is that F1000 is a poor predictor of important papers. Uneven coverage of the literature may offer some explanation. For instance, coverage of the field of cell biology was much stronger than surgery. F1000 faculty may also preferentially recommend papers promoting their own interests, such as papers published in one’s own journal. Waltman and Costas also wonder whether the process of selecting F1000 faculty, based on peer-nomination, may lead to biased reviews.
An alternative explanation is that F1000 and citations are simply measuring two different phenomena — F1000 recommendations are measuring the opinions of a select group of experts while citations are measuring the collective behavior of authors.
Whichever explanation you find more convincing, the analysis strongly suggests that early expert ratings are a poor predictor of future citations — far worse than simply using journal citation metrics as a guide.