Peer review–the process whereby new results are scrutinized by competent peers before publication–forms the heart of most scientific journals.
In recent months, investigations and allegations have questioned what some journals consider to be a “peer” and what exactly constituted a “review.” The scrutiny received by manuscripts can be so varied across journals, that the term “peer review” may hold very little value in itself. For some journals, the phrase has been used to make entirely false pretenses about what takes place between submission and publication.
Perhaps not surprisingly, some see a market in measuring and rating the peer review process across journals. I recently interviewed Adam Etkin, Director of Publishing at the Academy of Management and frequent commenter in the Scholarly Kitchen about his new venture, called preSCORE, which is described as a new metric that “measures the level of peer review conducted prior to the publication of scholarly material.”
Disclaimer: preSCORE was recently acquired by STRIATUS/JBJS, publisher of the Journal of Bone & Joint Surgery, whose CEO is Kent Anderson, fellow blogger and current President of the SSP. To avoid conflicts of interest, Kent has recused himself from participating in the interview and the comment section below.
Q: What problem do you hope to solve with preSCORE?
A: The short answer is that we will help to verify that a journal which claims to be conducting peer review, is in fact doing what they claim. The slightly longer answer is that preSCORE aims to support journals and publishers who value ethical, rigorous peer review, as well as those who read scholarly publications.
Q: When did you come up with the idea of preSCORE?
A: While out for a jog on a Saturday morning during the fall of 2009. I do not remember the exact date or time. Over the years I’d revisit the idea, tweak things, and get feedback from people in the industry. I was very fortunate that while Director of Publishing for the Academy of Management I was able to pursue my Masters Degree in publishing and I wrote my thesis about peer review and the preSCORE concept.
Q: A solution to the problem of trust in peer review has been transparency. The EMBO Journal, for example, publishes the entire peer review process, including all referee comments, editorial decision letters, author responses, and the timelines of submissions, decisions, revisions and publications. You have decided to come up with a single numerical ranking that summarizes the quality of peer review. Can you explain the benefits of your approach?
A: First, I think efforts such as your EMBO example are great. However, in my opinion, I don’t think most journals are ready to go that far, at least not yet, and they may lack the resources to do so even if they wanted to. Second, I want to stress the preSCORE is more than just a metric or a “single numerical ranking that summarizes the quality of peer review.” When we started to show people our idea they were too focused on “What’s a good number? What’s a bad number?” Our view is that ANY participation in preSCORE is good. ANYTHING we as a scholarly community can do to promote and support legitimate, ethical, rigorous peer review helps everyone. As I said earlier, at the most basic level we will let users know that an article was peer reviewed. If the user wants more details, including the metric, they can drill down for that. At the journal level we consider other factors such as COPE membership, plagiarism screening practices, transparency of retraction policies, and other best practices which we think should be followed by legitimate scholarly journals.
Q: The preSCORE algorithm is based on the h-scores (a numerical index of the productivity and citation performance of an author’s papers) of a journal’s editors and reviewers. Can you explain how the h-index measures the quality of peer review a paper receives?
A: At first as I developed the idea I was trying to figure out a way to indicate how many “eyeballs” looked at something prior to publication. That’s where the first algorithm came from. As I thought about it I wanted to come up with a way to also tell what “types of eyeballs” or how “expert” the people involved in the peer review process were. By factoring in the h-index it seemed to be a way we could try to do that.
Q: The preSCORE algorithm assigns three numerical values to editors (0.4 to the Editor in Chief (EIC), 0.3 to the Associate Editor (AE), and 0.2 to each of the reviewers). How did you come up with these weights? Does the EIC (who may spend just a few minutes reading the abstract and assigning an AE) provide twice the value as someone who spends two to five hours reviewing a paper?
A: Not the first time someone has asked me that! We think of an EIC as the Captain of a ship. They focus on the mission of the journal and help make sure that the ship (the journal) is on course; an EIC makes sure that the editorial board members, the AEs and reviewers are performing their duties properly and effectively. Ultimately a journal EIC is responsible for what gets published, so the buck stops there, so to speak. That’s why we’ve weighted them the highest in our algorithm. Also, there are generally more reviewers involved so I think that balances things out a bit.
Q: There is research suggesting that the quality of review declines with age (Callaham, 2011); at the same time, the h-score of authors increases with age. A graduate student (or post-doc) may provide an excellent quality review yet have a very small h-index score. At the other end, an emeritus professor may provide a very poor review and have a very high h-index. Is it possible to distinguish quality from longevity in these cases?
A: Another good question and something we’re thinking about. I’m aware of these types of studies, but I’m not certain older reviewers are necessarily poor reviewers. Having said that, it is absolutely true that a younger reviewer can do a great job. H-index seems to be the most widely accepted and available metric for what we are trying to accomplish. We are looking into whether or not the m-index, which takes into account the length of someone’s publication history, might be an alternative.
Q: Similarly, a journal that employs professional editors, like Nature, may score much more poorly than one that relies entirely on researchers, like eLife. Care to comment?
A: I’d just repeat what I said earlier. We don’t want to focus on “scores” as much as participation. It’s amazing when an athlete wins the gold, silver or bronze medal, but it’s amazing for any athlete who participates in the Olympics.
Q: Your algorithm also includes, in the denominator, the square-root of the article version in its score. Why does a paper that has gone through multiple revisions indicate poorer peer review? In my thinking, it suggests just the opposite.
A: I don’t think it indicates poorer peer review at all. A paper might come in to a journal that is really well written by a researcher who is tops in their field. It may only need 1 or 2 revisions. Is that “bad?” Of course not. We use the square-root because typically earlier rounds of review are much more rigorous then later rounds. By the final rounds of review some reviewers may have dropped off or they’re basically just saying “Yeah, they’ve addressed my concerns. Accept!”
Q: How does your algorithm adjust for missing data? Do you impute values?
A: The system would flag instances like those so staff could go in via an admin to see what’s going on and manually enter the appropriate info.
Q: What is the difference between preSCORE and preVAL?
A: While the preSCORE metric is the “heart” of the system, as I mentioned we did not want people to be put off by or confused by numbers and rankings. That’s why we came up with preVAL. PreVAL answers the first, most basic question of “Was this peer reviewed?” Just the appearance of preVAL lets the user know the answer to that questions is “yes.” We want to give the user an indicator of rigor, but we also want to try to avoid that idea of conflict you mention from a journal POV. Users who want more details about the peer review process behind and article can click the preVAL link and open a preSCORE window that displays additional information such as how many rounds of review were conducted, what roles participated in the review, the preSCORE metric, and more. Some journals may elect to share the reviewer comments for each round. Some may want to publish reviewer names while others may not. Right now preVAL is an article, not journal, level tool. Having said that, as we see wider adoption we may expand this to the journal.
Q: From your FAQ page, it appears that the publisher is the customer of your services. As we’ve seen in the financial market, there is a risk when ratings agencies have a financial relationship with those they intend to evaluate. How does your service address this conflict of interest?
A: If I understand what you’re asking, I don’t see a conflict of interest. Our evaluation and services must be 100% on the up and up. Anything less will destroy our brand as well as that of the journals who participate. Our entire philosophy and mission is grounded in ethical behavior and rigor for all involved.
Q: Can you explain the business and management relationship between preSCORE and the Journal of Bone & Joint Surgery?
A: STRIATUS/JBJS, Inc., publishes the Journal of Bone & Joint Surgery, along with other journals. JBJS is committed to “Excellence Through Peer Review” so our philosophies align. They also offer educational products and as their first data product preSCORE makes sense for them. With the acquisition of preSCORE by STRIATUS/JBJS, Inc., I will have more resources at my disposal in terms of management specialties and organizational infrastructure in order to bring preSCORE services to the market and sustain and build the preSCORE products. As far as management, I’m confident that I’ll have the level of autonomy and support I’ll need to make preSCORE very effective, and will also be able to bring my team’s perspectives to the management team at STRIATUS/JBJS, Inc.
Q: What do you hope to accomplish in 2014?
A: We’ve reached an agreement with Thomson-Reuters that will allow us to use h-index and data from Web of Science and the work on that will be completed in Q1. We’re also working on custom development with ScholarOne, Aries, BenchPress and more. Following that will be proof of concept. We have several publishers who will be participating. At the same time we are developing APIs and web services which will allow platform providers such as Highwire, Atypon and others to display preSCORE information. We expect to go live by Q4. Of course in addition to continuing to develop and build our services we will be spreading the word about preSCORE!
29 Thoughts on "A Metric for the Quality of Peer Review: Interview with Adam Etkin of PreSCORE"
A few obvious questions spring to mind:
First, how is verification done? If a disreputable journal is claiming to have an academic as the Editor in Chief, and claiming to do peer review, how do you know if they’re really doing it?
Second, and perhaps more importantly, it’s one thing to ask academia not to use a specific numbered scale to rate performance and quite another to have it actually happen. As we know from the Impact Factor, the intent of its creator is quite far from how it is used in practice. Does the use of a numerical scale increase the risk for abuse? Even worse for the researcher, their rating is going to be largely out of their own control. Researcher A submits a poor paper to Journal X, Researcher B submits a superb, groundbreaking paper to Journal Y. Journal X has an EiC with a high h-index score and sends it to one senior peer reviewer who gives it a cursory read and accepts it. Journal Y has a young rising star in the field as EiC, but her h-index isn’t yet strong because she’s still early in her career. The paper is sent to three expert postdocs who do a thorough review and require two rounds of very helpful revisions before the paper is published, where it inspires an entirely new line of inquiry. In this case, Researcher A has a higher score than Researcher B, and may get potentially more career credit for lesser work.
Hi David. Will try to give a quick answer now and come back later after my morning meetings for more details.
As to your first question…the first low level of verification is done when the journal exports metadata to us via their submission/peer review system. These systems tag the role who participated in the review process. Second, as I mention in the interview, we are doing more than just running calculations of a metric. We look at membership in COPE, plagiarism screening etc. and other best practices I think we all agree that “disreputable” journals don’t participate in. We’ll have other “guardrails” in place to weed out anyone who is trying to game anything or not doing what they claim,
As to your other points, preSCORE is attempting to measure the peer review process, not the research itself. Although we do believe that in most (not all) cases a more thorough review process results in better research. Having said that, your point is well taken, which is why we try to get away from this “good number/bad number” type of thinking. As for the scenario you describe, I’d call into question why The EIC of Journal X and the senior peer review in your scenario would publish a “poor paper?”
I think there’s a certain risk one takes in turning loose a metric on the world. Just as you’re not trying to measure the research itself, so the Impact Factor was not meant to measure the individual researcher or paper, and we know how that worked out. Human brains seem to crave numerical ranking systems, and I worry that if you give them one, they will use it.
As for the scenario you describe, I’d call into question why The EIC of Journal X and the senior peer review in your scenario would publish a “poor paper?”
It happens for all kinds of reasons. Maybe the author was a big name in the field, as some journals are happy to publish even incremental results if they have a big name attached. A recent study suggests that big name scientists draw more citations:
I view preSCORE as one more tool the research community can use when trying to filter through all the material that is out there. The intended use of the tool is to further validate and support journals that conduct legitimate peer review. The Bohanon sting was not really about OA, it was about peer review. For arguments sake, would that sting have even been needed had a service such as preSCORE existed and been widely adopted? At some point in history someone needed to drive a nail into a piece of wood so the hammer was invented. Unfortunately at another point in history someone decided to pick up a hammer and smash people in the head with it. This is a silly way for me to say we need the “right tool for the job.” I am of the opinion that more tools/filters are better, but let’s use them for the job they are intended.
You’re right that journals sometimes publish things they should not. I realize it may be a naive stance, but shouldn’t a really good editor not publish based on name alone? I’ve worked with many editors who reject submissions from “big name” people in their field because the work was not up to standard. That’s how it should be, isn’t it? It’s hard to balance how we’d like things to be with the real world, but we are trying to work towards something better with preSCORE.
I see a fair amount of irony in the sentence ” shouldn’t a really good editor not publish based on name alone?”, when preSCORE precisely gives big points to big names (measured by h-index).
We all agree that famous researchers do not necessarily produce only great papers, so that a good editor should not accept a paper based on the authors’reputation. We should then all agree that high h-index researchers do not necessarily make good and rigorous editors, so that a measure of the quality of peer-review (and, I would add, editorial process) should not be based on the h-index of the editors and reviewers.
rigor = h-index ? Really? Thomas Kuhn would know exactly how to categorize this.
I was reading about this service the other day following the press release from JBJS and I had to read through the web site several times to get my head around it so thanks for this interview. I really do appreciate that there are people willing to experiment with our vastly changing business models and practices.
That said, I too am concerned about using an h-index for scoring whether or not a paper received quality peer review. In a world where editors are volunteers, it is often true that once they hit department chair, they no longer have time to serve as an editor. There is a sweet spot in finding people who are known and yet not so popular that they are too busy to answer queries from the editorial office. Further, for fields in which there is a certain level of practitioner participation as editors and reviewers, the h-index is completely irrelevant.
I am also trying to figure out what problem this solves. Are authors really concerned or confused about the level of peer review being done? Do readers really care and will libraries only subscribe to journals that have a good score? What about journals who define peer review differently? There is a vast difference in say the PLOS One model versus what many traditional journals are doing. Does PLOS One score lower because the reviews are less rigorous?
I do think this is an interesting experiment and I kind of like that it is countering some other movements attempting to devalue peer review by making it as fast as possible at any cost. Good luck.
Thank you for the well thought out response and the questions. I will try to address as many as i can.
The concern over how we are using h-index is something we’ve heard, even before this interview was published. As I discussed, what we are trying to do is not just answer the question of “was this peer reviewed?” but also give an indication of the level of people who were involved in the peer review process. Is h-index (or any metric for that matter) perfect? Nope. But it is currently the most widely accepted and available metric we could use for this purpose. We are looking into whether or not we will be able to use the m-index instead of the h-index for our calculations. M-index considers the length of a researchers publication history and evens out the scoring so younger researchers are not ranked as “lower” compared to those with longer publication records. It’s still unclear if we will be able to to get the m-index from WoS or not, but if we can and we change our approach we will certainly let everyone know.
As to your other questions, unfortunately I do think there is growing concern about the level of peer review conducted by journals. We’ve got more and more predatory publishers (note i did not say OA here) claiming to conduct peer review and they are not. Even well-established journals are having peer review called into question (arsenic-life etc.). Non-peer reviewed material is appearing in peer reviewed journals and creating confusion, often with very negative results (see Kent’s SK post from last year about The Economist paper: http://scholarlykitchen.sspnet.org/2013/06/04/austerity-research-when-ideology-and-polemicism-overwhelm-facts-and-logic/).
Will PLoS ONE “score” lower? We don’t know yet. I’d love to have them participate! On the one hand, some “mega journals” have eliminated the role of the EIC or overseeing editor, a role preSCORE places extra value on. They have recently had their peer review called into question. However, they place great emphasis on rigorous peer review and it is possible we may find that they rank very well in the preSCORE system.
These are all questions we can have answers to as we get more participation and adoption of preSCORE. As you say, we are trying to work to improve things and support quality peer review. Appreciate your thoughts and questions!
I think the explosion of metrics for nearly everything generates the need for a lot of effort with possibly diminished returns. This focus on ornate metrics is perhaps a sign that the best way to decide whether a journal (and its peer review) is of good quality is to take the time to read some of the papers in it. This whole discussion seems to be predicated on the assumption that researchers don’t regularly read journal articles, when in fact many of them still do (http://www.nature.com/news/scientists-reading-fewer-papers-for-first-time-in-35-years-1.14658).
Reblogged this on Envision, Educate, Elucidate and commented:
Updates in scholarly communication & peer review.
Before I continue commenting in this discussion, a disclosure: judging quality of peer review is central to my company’s services, so I have a competing interest.
With that out of the way: Labeling this particular metric as measuring “quality of peer review” is obviously wrong, and possibly harmful.
It is immediately clear that the age of the editors, followed by age of the reviewers, has by far the strongest factor loading on the metric. Secondarily, the scientific impact of papers written by them also correlate, but less than age. So, “high quality peer review” = “journals edited and reviewed by old people who are well known for their own research”? I strongly disagree.
There needs to be a measure that is able to ruthlessly expose sub-standard peer-reviewing (either by individual or by a journal), even when the editor is a highly respected, extensively cited authority. Especially when.
Obviously i am going to strongly disagree that preSCORE’s approach is “obviously wrong, and possibly harmful.” What’s wrong and harmful are publications which claim to conduct peer review but don’t. How often do we hear “peer review is broken?!” “Peer review doesn’t work?!” I actually think that traditional pre-publication peer review works pretty darn well WHEN IT IS DONE PROPERLY. The problem is that it’s been misused too long by too many. This is not to say I do not see value in alternate methods. Pre-print review, post publication review, and new services such as those offered by Rubriq and yes, Peerage of Science, all have a place. Why limit ourselves to the types of tools at our disposal? The goal is to support research and to assist those ethical editors, reviewers, and publishers who work hard to publish the very best material.
High quality peer review” = “journals edited and reviewed by ethical people who are working hard to publish good research.”
I’ve addressed the issue of h-index in other responses here as well as in my original response in the interview.
I agree that we need to shed light on sub-standard peer review, which what preSCORE is trying to do. I don’t think our goals are too far apart. I hope as things progress we might find common ground and work together towards improved, more transparent peer review for all engaged parties.
My problem with PreSCORE is it measures inputs, not outputs, and just presumes there’s a relationship of input to output. Highly regarded editors and reviewers (input) asking for minimal changes in review will get a good score. More junior people working their tails off through multiple revisions to produce a good paper will score lower. PreSCORE does not measure how often serious errors are caught (output) or how much a manuscript improves in review (output).
Giving so much weight to the EIC’s publication record is silly. The EIC is a management position, and publication record does not translate well to management skill. Harold Ross, who founded The New Yorker, was an undistinguished writer but a superb EIC.
From an editor’s view, there’s so much disincentive in aiming for a high PreSCORE, it becomes a joke. Compliance with COPE’s ethical standards if a far better indicator of responsible peer review.
Thanks for the comments. Hopefully I can clear up what I think are some misconceptions.
1. If I understand your first point, a paper which goes through multiple revisions with “junior people” will actually have a score which evens out or might be higher than a paper reviewed less by others. Yes, the later rounds of review are weighted less heavily than earlier rounds, but they still count. If you re-read my responses to Phil’s questions you’ll see that we acknowledge some of your concerns.
2. I don’t agree that a scholarly journal EIC is just a “manager.” Again, I think I’ve explained why we feel that way in my response to Phil’s question.
3. Again, you’re so focused on the metric etc. I’d ask you to re-read my responses and you’ll see that I state that there are other factors we look at, including if a journal is a member of COPE.
On “3. Again, you’re so focused on the metric etc. I’d ask you to re-read my responses and you’ll see that I state that there are other factors we look at […]”, one big problem is that, as pointed out by David Crotty, giving a metric along with a detailed analysis is the best way to have the analysis forgotten completely.
Moreover, you answers about h-index being not perfect but ok are unconvincing. h-index is a not-perfect measure of scientific impact, it is in no way a measure of the quality of editor’s or reviewer’s work. You are trying to measure temperature with a barometer here!
Speaking of outputs: Are the number of retracted papers from a journal taken into account in these metrics? Doesn’t this reflect on the quality of peer review? There have been some high profile concerns in journals that, in my opinion, should have been easily caught by high quality peer review.
Great point. As I mentioned briefly in my response to one of Phil’s questions, we do consider the journals retraction policies in our evaluation. Right now we are not counting the number of retractions but I think you make a valid point and this is something we need to consider. Thank you!
I apologize, but I do not understand the answer to “Similarly, a journal that employs professional editors, like Nature, may score much more poorly than one that relies entirely on researchers, like eLife. Care to comment?”. Let us presume that the world figures out how to ignore the metrics that preSCORE provides. What exactly does the preSCORE provide other than these metrics? I’m not quite sure that I understand how the service accounts for the noted issues between, as the question asks, how Nature and eLife approach who they use in their editor pool.
Hi Julia. I don’t think it’s a given that those journals will score lower. I do think a good EIC is very valuable. Metrics aside, as I said in the interview and some of my other responses the other service we hope to provide is validation for those journals who are conducting peer review in the manner in which they say they are. Similar to organizations such as COPE, Sense About Science, and others we want to raise awareness around the importance of good, ethical peer review. I think there are things we can do to increase transparency around the process while still preserving the anonymity many desire. At the same time we can support those who want more open peer review. If a participating journal wants to display reviewer comments and/or names, we can support that. It’s up to the journal. We won’t share any info they don’t want to. I think some if this will become more clear when we update our website and are able to share some examples of how this all works.
For subscription journals the “price signal” has been an effective (although not perfect) way to measure/reinforce editorial process quality and associated output. Journals with poor editorial process and output don’t get subscribers, and die.
OA journals don’t benefit from the same market mechanism, so substitutes like PreSCORE are sought.
Another alternative might be to give libraries a “virtual budget” to “spend” on “virtual subscriptions” to OA journals thereby simulating one of the positive aspects of the price signal.
I guess my most basic disagreement with Dr. Etkin is that I don’t want to “fix” the traditional peer review system. I want to replace it. The technological developments of recent decades make it possible to do so very much more. As an example, have a look at the concept of Open Evaluation, cf. http://bit.ly/TXztRY
First, While I am not a doctor, my mother will certainly be overjoyed that you referred to me as one. 🙂
I’m familiar with Open Evaluation and after reading the info you’ve provided I’m struck by how many similar ideas there are between that and what we are proposing with pre-score. I do disagree that the traditional peer review system needs to be replaced. It still works pretty darn well when conducted correctly. Let’s work to improve it rather than throw the baby out with the bath water. Technological developments have improved automobiles but I still drive a car to work every day. I also have a variety of “metrics” (MPG, Crash Test Ratings, Customer Reviews etc.) which help me decide what kind of car to buy.