While there is a steady stream of journal articles criticizing peer review, a recent publication, “Comparing published scientific journal articles to their pre-print versions”, has a number of major problems. It’s perhaps ironic that a paper finding no value in peer review is so flawed that its conclusions are untenable, yet its publication in a journal is itself an indictment of peer review.
The article’s premise: journal peer review is meant to improve manuscripts, thereby justifying the cost of APCs and subscriptions. The authors hypothesize that those improvements should lead to measurable changes in the article text. They obtained preprints from arXiv and bioRxiv and located their published equivalents, then used a suite of text divergence metrics to compare them.
There are three serious problems right away:
- That ‘more rigorous peer review = more text changes’ sounds plausible, but the authors don’t test this. It’s easy to imagine lots of exceptions (e.g. a very good paper). So, no matter what level of text differences they find, the authors can’t link those differences to the ‘value’ of journal peer review. This oversight fatally undermines their “journal peer review does nothing” conclusion.
- The authors assume that a posted preprint has never been through peer review. A significant fraction of preprints (particularly the later versions) may well have been reviewed and rejected by a journal, and then revised before being reposted to arXiv. Comparing these preprints with the published versions will greatly underestimate the effects of peer review on the text.
- Rejected articles often do not end up in print, so the authors cannot examine the other key value proposition of peer review, which is identifying and rejecting flawed articles. Moreover, being restricted to accepted articles biases the dataset towards articles that required fewer revisions during peer review.
Their main analysis focuses on the arXiv corpus of articles, and here we also encounter a major problem with the paper. The authors find that the body text of preprints and published articles are (by their metrics) almost identical. Their Figure 5 is below, with the bars showing the number of articles within each metric value bin. (For example, just over 6000 articles had a length comparison score between 0.9 and 1.)
The authors conclude:
The dominance of bars on the left-hand side of Fig. 5 provides yet more evidence that pre-print articles of our corpus and their final published version do not exhibit many features that could distinguish them from each other, neither on the editorial nor on the semantic level. 95% of all analyzed body sections have a similarity score of 0.7 or higher in any of the applied similarity measures.
However, these data are based on a comparison of the last version posted to arXiv with the published article. It’s well known that authors often update their arXiv entries with the version that the journal reviewed and accepted for publication. Figure 6 suggests that this is precisely what is happening: most of the final versions were posted to arXiv less than 90 days before the article was published, and some even are posted after publication.
As a Scholarly Kitchen reader, you’ll be (painfully) aware that it’s rare to get an article submitted to a journal, peer reviewed, revised, resubmitted, re-reviewed, accepted, typeset, and published in under 90 days, or even 180.
Most of these final arXiv versions have therefore already been through peer review, and any observed text differences between them and the published article are due to trivial changes during copyediting and typesetting. The authors’ main analysis thus tells us nothing about the value of journal peer review.
Next, the authors move to the dataset they should be analyzing: the body text of the first version posted to arXiv compared to the published article. They still don’t take any steps to remove articles that have already been through peer review, by, for example, excluding those that were posted less than 180 days before the published article came out, or those that thank reviewers.
Although the figures are in this part are very hard to understand (Figure 9 shows the % difference between Figure 5 and the equivalent, unavailable, bar chart for the first arXiv version), it seems there were more text differences between preprints and published articles in this comparison. Nonetheless, as noted above, we have no benchmark for the level of text difference expected for an article that has had a thorough peer review process versus one that has not, so it’s still impossible make any connection between the text changes seen here and the value of the peer review process.
Lastly, the authors repeat their analyses for articles from bioRxiv. Since bioRxiv prevents authors from uploading accepted versions, differences between the preprint version and the published version should more reliably represent the effects of peer review (even though they are still analyzing the last preprint version). The authors do not attempt to identify preprints that have already been through peer review, either at the publishing journal or another that reviewed the article and then rejected it. As noted above, this biases the study towards finding a weaker effect of peer review.
The equivalent of Figure 5 (above) for bioRxiv looks like this:
Here, there have clearly been substantial changes to the body text between the preprint and the published stage. The authors explain their different results for arXiv and bioRxiv as follows:
“We attribute these differences primarily to divergent disciplinary practices between physics and biology (and their related fields) with respect to the degrees of formatting applied to pre-print and published articles.”
This explanation seems like an attempt to get the data to fit a preconceived conclusion (I would be interested in hearing thoughts from any physics journal editors reading this). A more plausible reason is that the bioRxiv dataset really does (mostly) compare articles from before and after peer review, and that peer review has driven substantial changes in the text. The authors seem curiously reluctant to embrace this explanation.
Does it matter that an article in an obscure journal suffers from both flawed methodology and a presentation heavily in favor of the authors’ preferred result? Yes – it matters because articles in peer-reviewed journals have credibility beyond blog posts and Twitter feeds, and this article will doubtless be held up by some as yet another reason why we don’t need journals and peer review.
We do need more research into peer review: we have to understand when it works and when it doesn’t, and how the motivations of the various parties interact to drive rigorous assessment. The PEERE program is doing exactly this.
However, useful assessment of peer review requires high quality, objective research. The authors here are doubly guilty: their article is so poorly conceived and executed that it adds nothing to the discussion, yet the authors are so keen on a particular outcome that they are willing to retain flawed datasets (the final version arXiv dataset) and brush contrary results (the bioRxiv dataset) under the carpet.
Some of these issues were brought to the authors’ attention when they posted this work as a preprint in 2016. Unfortunately, they chose to submit the article to a journal without addressing them. Somehow, the journal accepted it for publication, warts and all. This tells us three things:
- The preprint commentary process is not sufficient to replace formal peer review. With no editorial oversight, and nothing to be gained by authors for correcting errors that invalidate their conclusions, authors are under no obligation to fix their article.
- Formal peer review remains a flawed process. The journal’s editor and reviewers could (and should) have forced substantial revision – because they didn’t, yet another weak article about peer review has joined the public record.
- This piece itself should count as post-publication peer review. Regardless of whether I’ve convinced you of the paper’s flaws, their article will still be sitting, unchanged, on the journal’s website.
Given the many flaws identified, particularly the lack of a clear link between text changes and peer review quality, this article is mostly of interest to people who like text analysis. We can only hope that the community is able to resist using it as evidence that we ought to abolish journals.