PubMed Central (PMC) is by far the largest public repository of biomedical articles. This is made possible, in part, by federal mandates that require authors who received funding from the National Institutes of Health (NIH) to deposit copies of their final author manuscripts into the archive. The bulk of articles in PMC, however, are deposited by publishers on behalf of their authors.

Which leads us to wonder whether publishers, by depositing research articles into PMC, may be collectively creating an über publisher that is competing against them for readership of of their own articles.

Alternatively, PMC could be satisfying the needs of readers traditionally underserved by scientific journals. In this sense, PMC would be complementing publishers by increasing their scope of dissemination.

I was fortunate enough to work on a study recently that attempted to answer the question of whether PMC is complementing, or competing with, journal publishers. The article, “The Effect of Public Deposit of Scientific Articles on Readership” is published in the October issue of The Physiologist.

The study focuses on readership (as measured by full text HTML article downloads) for 3,499 articles published in 12 research journals published by the American Physiological Society (APS) between July 2008 and June 2009. Articles published in APS journals are available to subscribers for the first 12 months from final publication, after which they become freely available to all readers from the journals’ websites.

A total of 1,886 (or 54%) of these articles declared some form of NIH funding, were deposited by the publisher into PubMed Central, and were made freely available from PMC 12 months after final publication. These papers formed the treatment group — the treatment, in this case, being deposited into PMC.

Figure 1 describes how each of these groups performed over the first 24 months of publication.

Figure 1. Mean full text downloads (±95% C.I.) as measured on the journal websites for the first 24 months after final publication.

After final publication, both treatment and control articles received about 85 downloads per article, on average, in their first month. Between months 1 and 12, treatment articles behaved similarly to control articles, which is to be expected, considering that treatment articles remained embargoed in PMC until month 13.

Between months 12 and 13, something dramatic took place. Control articles saw a major boost in full text downloads when they became freely available from the publishers site. Treatment articles also received a boost but the increase was about 14% smaller than control articles. Since the treatment articles were also freely available simultaneously from the journal websites, the reduction in reader traffic cannot be explained by differential access barriers. We are testing free against free.

Unless you can come up with an alternative explanation, it is pretty clear that PMC is responsible for that 14% reduction in usage statistics from the journal websites. The publisher isn’t trying to sell access to those articles — they are given away for free — but the publisher can report these statistics back to their subscribing institutions as evidence that their materials are being used and have value.

As librarians base cancellation decisions, in large part, on publisher-provided usage data, a significant reduction in usage data as a result of PubMed Central may have unintended consequences on those publishers who participate in the PMC direct article deposit program.

Business model aside, drawing readership away from the journal to a repository may have more serious long-term consequences for society and association-based publishers. When readers turn from a journal to a repository, the publisher may be unable to point readers to related articles, editorials, and commentary surrounding the article of interest. It may also represent a lost opportunity to deliver news, educational material, advertisements (job announcements, grant and travel opportunities, products and services), and society events (conferences and workshops) to the reader. In sum, the publisher loses some ability to create a community of interest surrounding the journal.

PMC icon

From a network science standpoint, the discovery that PubMed Central has a negative effect on publisher-side article downloads should not be surprising. Google favors large, interlinked nodes in its search results, and readers searching PubMed would view the abstract with a FREE icon pointing them to the PMC article version. (Some publishers feel strongly that PubMed preferentially promotes the PMC copy over the journal copy.)

This first study of the effect of public deposit of NIH-sponsored articles in PubMed Central strongly suggests that PubMed Central is competing, in part, with journal publishers for readers of the scientific literature. It does not answer:

  • Whether the results of physiology journals are generalizable to other disciplines
  • Whether public access reduces PDF views in the same way as full text (HTML) views
  • Whether final author manuscripts are a substitute for the final published article
  • Whether the effect of PMC is growing as the repository increases in size

In the next few months, I hope to answer these questions as well.

Enhanced by Zemanta
Phil Davis

Phil Davis

Phil Davis is a publishing consultant specializing in the statistical analysis of citation, readership, publication and survey data. He has a Ph.D. in science communication from Cornell University (2010), extensive experience as a science librarian (1995-2006) and was trained as a life scientist.


31 Thoughts on "Is PubMed Central Complementing or Competing with Journal Publishers?"

Phil: Will future studies also aim to see how much PMC usage is outside of academia entirely, and by whom? I recall very informal discussions over the years that biomedical articles were used heavily by patient populations seeking items to show their physicians, while non-biomedical articles had less popular usage in general.

Pubmedcentral also normalizes content, apparently in ways that many readers find useful. It is easier for readers to access, search, retrieve and read content in one uniform way compared to the hundreds of distinct publisher and journal sites. As much as every journal editor may believe that their particular fonts, layout, site organization, login protocol, etc. are “the best” the reality is that users find the variation across sites to be an annoyance.

Aggregators like Google Scholar may negate this difference. Why have two different copies in physical repositories?

The article fails to include PMC download data. Does the number of downloads of an article through PMC just make up for the 14%, or does PMC in fact help the journal serve its primary purpose of distributing scientific information by enhancing the total distribution of content?

Yes, you are correct. PMC does not provide IP-level usage data, so I was unable to compare the unique user communities served by each content distribution node. I surmise that PMC is reaching a broader user community than the journals, but remember that the journals are giving away the same content at the same time as it becomes freely available in PMC. The immediate and sustained loss of 14% of journal-site downloads has to be explained and the only reasonable explanation is that readers are being drawn to PMC instead of the journal sites.

Great post Phil. Views on PMC are probably more than compensating for the 14% drop in views at the journal, so having papers on PMC should increase the total number of eyeballs on the paper, which in turn will increase citation. If Impact Factor carries more weight than usage in cancellation decisions, maybe PMC is having an overall positive effect?

Moreover, the eyeball boost at month 13 will prompt readers to cite the paper in their own work, and the majority of papers in progress then would be published the following year (i.e. after month 24). It’s therefore possible that the extra PMC boost in month 13 has a bigger effect on IF-relevant citations than would happen with journal-based free access alone.

Thanks Tim. In a randomized controlled trial using the very same APS journals, providing immediate open access did indeed increase readership (as measured by article downloads) from a larger audience (as measured by unique IP addresses), and yet had no measurable effect on citations 3 years after publication, see:

Davis PM. 2011. Open access, readership, citations: a randomized controlled trial of scientific journal publishing. The FASEB Journal 25: 2129-34.

Good stuff Phil. Perhaps the most puzzling feature to me is that the downloads barely drop off with time, in both series. This may be telling us something important, but perhaps it is well known to the metrics community. The striking parallel between the two cases seems to reflect this dynamic.

That PMC is stealing eyeballs seems clear. An interesting policy issue. PMC was conceived to be the only portal but the publishers have countered it. One of the big rules is that the US Government is not supposed to compete with the private sector. Who hoo.

I am also puzzled as to what an HTML download is? Is that a pageview or something new? In either case one may need to consider the impact of robots crawling the site repeatedly, which might factor into the persistence of the numbers. This is why I regard PDF downloads as a better metric. Looking forward to those results.

In the study, a HTML download is a full text page view. HighWire filters out downloads from known indexing robots. I hope to report on PDF views from a similar study.

I am a little confused by the emphasis on PubMed having a negative effect on publisher-side downloads or “stealing eyeballs.” It looks to me like publishers get more hits after open access kicks in. The data show that when access becomes open after month twelve, downloads go up for everyone. Is this not a case of a “rising tide lifting all boats”? Would publishers prefer to continue on with their lower level of pre-open access hits or have more hits overall, even if the new amount of hits is less than what PubMed Central is getting?

Actually, I just reread the thing and see what you are saying now. I still am not in 100% agreement that there is a real problem here. Do publishers have stats on how many of those extra web site features are actually used?

Usage of the different web features varies from journal to journal, article to article. We do know that readers do find notices of retractions and corrections to articles to be important. We also know that usage is one of, if not the highest metric that librarians use for their subscription decisions. We also know that advertising is sold by impressions and/or pageviews.

An important recent driver is the fact that PubMed search results display links only to the PMC version of an article – in contrast to PubMed abstracts, which display links to both the PMC version and the version on a publisher’s website.

Currently the predominant user behavior is to read the PubMed abstract and then make a choice whether to view the full text (which would in theory give the publisher version at least a 50% chance of being read). This, however, is mainly due to habit, and it’s reasonable to assume that over time readers intent on reading the full text will save themselves a click and go directly from a PubMed search result to the PMC version (a course that gives the publisher version a 0% chance of being read). This decision to display only PMC links in PubMed search results is one of many ways in which NLM uses PubMed/Medline’s near monopoly on (biomedical) academic search to strengthen PMC.

The phenomenon may concern developers/users of alt metrics: an additional download source means an additional parameter(s) that not only complicates analyses but, as Phil points out above, represent less comprehensive usage data.

Two things.

1. PMC only admits English language journals. Bad policy for the emerging world who struggles to get their research acknowledged.

2. It would have been interesting to read about the dual publishing problem. What happens with Errata when articles are published in two different repositories, the journal’s and PMC? How would Cross Mark work in that context?

Let me add that we have a similar problem in Latin America with SciELO, a repository-considered-data-base of learned societies journals . SciELO is low standard but has gained widespread acceptance because of the free publishing and free access.

Are users not aware that the PMC version is NOT necessarily the version of record, whereas the article available from the publisher’s website is? Why would anyone prefer the former to the latter when both are freely available?

Phil, what do you make of the shorter error bars on the PMC content vs. the journal set? Anything to interpret there? It seems like the journal-based content might be getting some heavier use within the set, which averages out as shown.

The confidence interval of the mean is based on the variance in article downloads within each monthly cohort. When the numbers are smaller, you get a smaller C.I. I don’t think there is anything to read into this.

Fascinating work, thanks Phil. I would be curious to know whether PMC is stealing eyeballs or capturing new ones.

If those journals were suddenly not available in PMC, would there be a commensurate 14% jump in publisher downloads or would it remain relatively unchanged? Any way you could test that?

Thanks Bill. I think there is some misunderstanding of the study. Imagine two groups, approximately equal in size, similar in all respects except that one group gets deposited into PMC and waits there, under embargo, for 12 months. For those first 12 months, both groups perform almost identically, as they should. Then, the embargo is lifted and we observe both groups for another 12 months. That 14% decline in readership for the articles sent to PMC is the amount of readership that is shifted from the publisher’s site to the PMC site. PMC may be capturing new readers, but this study isn’t measuring that. Does this help what the study is reporting? –Phil

This is an interesting study that raises important questions. There is a potential limitation to your study which your reply here suggests you have not considered. Do you test whether the articles with NIH funding and those without are truly identical, or do you assume that they are identical? There is the possibilty that the topic, the research community, or some other characteristic of the content/authorship of the articles funded by NIH are systematically different from those without NIH funding and explain your results. This systematic difference may be related to a characteristic of end users such that end users of NIH funded research are less tightly bound to the community of practice you hypothesize the publisher creates. If so, those end users may systematically favor an NIH sponsored source. They may always wait until the article shows up in PMC before they access it. If so, PMC is not stealing eyeballs because the trends post month 12 would not simply project backward to months 1 to 12 as you assume. It is possible that the character of the article and end user, not the download source, determines the differences seen after month 12. It would be possible to examine citation patterns (not numbers but interrelatedness of institutions, keywords, etc.) of the articles between NIH funded and non-NIH funded studies to see if there are systematic differences. That’s a lot of work, and would not be conclusive evidence for or against the assumption of identity, but until it is done, it remains a limitation of your findings.

Indeed, we assume that the control cohort and the treatment cohort are similar in all respects except that the treatment group is deposited and made freely available in PMC after a 12-month embargo. This was not a randomized control trial and therefore I cannot completely disregard systematic differences between the two groups. However, these cohorts do perform remarkably similar to each other between months 1 through 12 and they were published in journals, which (I hope), make editorial decisions based on the quality of the paper and not the funding source. I know of no studies of readership in which funding affects readership later in the life of an article. Taken together, I think we can agree that it is very unlikely that the observed pattern is some artifact of the group and not the treatment. Thank you for this comment. If you wanted to verify the results by conducting some tests for systematic differences between the groups, I’d be happy to comply by providing you with the dataset. –Phil

Phil, excellent study. It I understand it correctly these are articles that are not freely available on the publisher web site either before or after the embargo. After 12 months the experimental group becomes freely available on PMC but still not on the publisher site.

If this is true, the publisher may be losing people to PMC who have access through their universities. They would loose me. It is much easier to access it through PMC than the publisher site. I have to log it, go through the library portal, find the journal and work through the publisher site to the appropriate volume and issue to find the article and download it. It takes me a couple minutes and is a pain.

I don’t know what the impact of this but it may account for some of the loss.

Thanks David. All articles become freely available from the journal websites 12-months after final publication so there is absolutely no barrier to access to either the control group or the treatment group. What is being measured here is the effect on publisher-recorded article downloads when NIH-sponsored articles move out of their embargo in the PMC repository. Sorry if this was confusing. –Phil

Thanks Phil. No it is clear. I just missed the sentence when first reading the article. What is impressive to me is the huge spike in access as the article become freely available.

Comments are closed.