English: : A mirror, reflecting a vase. Españo...
English: : A mirror, reflecting a vase. Español: : Un espejo, reflejando un vasija. (Photo credit: Wikipedia)

Recently, traffic from a set of journals from the American Physiological Society (APS) was studied, and it was found that PubMed Central’s (PMC’s) version of APS content decreased HTML views at the journal sites by about 14%. Later, in writing a separate post about the potential costs the PMC traffic drag might be creating for publishers, I did a quick comparison with PLoS data, and found that PLoS journals overall lose 22% of their traffic to PMC.

The fact that PMC draws traffic away from PLoS was really puzzling to me — why would journals that are always free for readers to access give up nearly 1/4 of their traffic to another free resource?

The puzzle grew more complex after I spend an hour or two looking further into the PLoS data. It turns out that the 22% overall average is part of a fairly wide spread of traffic migrations, ranging from 13.5% to more than 30% among PLoS titles.

Traffic is important to PLoS, since they sell advertising against every title included in the dataset they publish. In 2010, they generated more than $280,000 in online advertising revenues, a figure that has increased in the two years that have transpired, judging from their current media kit’s claims. If nearly one-quarter of their traffic is being lost to PMC, the economic cost could be deep into the five figures, and may be by now approaching the six figures.

Again, there was the basic question: Why would one free resource take traffic from another free resource? But now, there was more — Why would one resource take traffic differentially from a set of free resources? What economic theory, behavioral theory, or traffic theory supports that finding? What might influence the degree of attrition? What factors might contribute to a lower or higher rate? And why would there be any difference at all?

Here’s how the PLoS journals’ traffic losses break down, from launch through October 2012:

  1. PLoS Biology: 13.5% lost to PMC
  2. PLoS Computational Biology: 15.0% lost to PMC
  3. PLoS Medicine: 17.1% lost to PMC
  4. PLoS Genetics: 21.7% lost to PMC
  5. PLoS Pathogens: 28.4% lost to PMC
  6. PLoS Neglected and Tropical Diseases: 28.6% lost to PMC
  7. PLoS ONE: 30.6% lost to PMC

Is there something in Google diverting people? I tested a few articles, and couldn’t find a discernible pattern. Are there big spikes in the data driving the average for each title? There are some spikes, but they aren’t enough to skew the overall effect, and why would one free resource be preferred in any case?

There was another interesting phenomenon in the data — there were occasionally articles for which PMC provided the bulk of the traffic. That is, more views (sometimes by a factor of 1.5-2.5) occurred on the PMC version of the article than on the PLoS version of the article.

I asked my Kitchen Cabinet (never before used that term here, feel like it’s overdue) what they thought, and there were multiple ideas and recommendations. But we’d really need time-series data and clickstream data to definitively answer the question. Also, because the data are just numbers and have no demographic dimensions, there’s no way to know if the users differ between the venues. Some speculation circulated around use of PubMed as the search engine of choice, which has been designed to point to the PMC version in the results list while suppressing the publisher’s version. Some speculation involved social media pointers. But there was one line of reasoning I found compelling — branding.

Looking down that list above, the stronger PLoS brands — more distinctive, more clearly matching a domain of knowledge, and, in the cases of Medicine and Biology, more entrenched — are less subject to traffic migrations than the weaker brands, like PLoS ONE, which is undifferentiated and therefore less entrenched.

Brands make promises, with relevance being a promise audiences look for. The promises of the stronger brands are clearer — focused editorial content for an addressable domain. Even brands like PLoS Pathogens or PLoS Neglected and Tropical Diseases aren’t clear about exactly what they are — basic science, clinical, or a mix of both? For researchers, infectious disease specialists, microbiologists, virologists, or others? Hence, it’s a weaker brand that has a weaker implicit promise of relevance.

In fact, looking over the list above, if I had to rank the brands by specificity and clarity, I’d put them in something resembling the same order.

Lacking enough brand punch to promise relevance, the next level that can work is the article level, where specific keywords can deliver on the promise of relevance. Hence, there is more article-level activity off-site for these weaker brands than for the stronger brands with the clearer promises.

Of course, if branding carried value, you’d expect price differentials around APCs. Here is the list of PLoS APC pricing:

  • PLoS Biology — US$2900
  • PLoS Medicine — US$2900
  • PLoS Computational Biology — US$2250
  • PLoS Genetics — US$2250
  • PLoS Pathogens — US$2250
  • PLoS Neglected Tropical Diseases — US$2250
  • PLoS ONE — US$1350

There seems to be a decent correlation between traffic leakage and APC pricing, suggesting again that branding is playing a role in establishing value.

In addition, based on the data above and its likely connection with branding, it seems PLoS might want to revise its pricing, as it is underpricing for PLoS Computational Biology and overpricing PLoS Neglected Tropical Diseases — one has the ability to attract readers based on brand and content, while the other leaks readers to other venues because of its weaker brand.

And what about those outliers, those cases where PMC actually outperformed the PLoS sites for traffic? I think they’re the results of links from media or social media coverage, since the articles for which this occurred seemed to be on topics that would generate such linking, either by virtue of topic or headline. Once a link is made, it can get amplified through social media especially (retweeting, copy and paste), and if an initial link were made to a PMC version, it might persist throughout the Interwebz for a long time.

There is another aspect to these findings — namely, that the mere presence of PMC diverts traffic from publisher sites and harms the associated businesses even if those sites are free from the moment of publication. There is no clearer evidence I can think of indicating that PMC is both competitive and redundant.

Enhanced by Zemanta
Kent Anderson

Kent Anderson

Kent Anderson is the CEO of RedLink and RedLink Network, a past-President of SSP, and the founder of the Scholarly Kitchen. He has worked as Publisher at AAAS/Science, CEO/Publisher of JBJS, Inc., a publishing executive at the Massachusetts Medical Society, Publishing Director of the New England Journal of Medicine, and Director of Medical Journals at the American Academy of Pediatrics. Opinions on social media or blogs are his own.


49 Thoughts on "The Hall of Mirrors — Trying to Explain Why Users Value Free Content Differently"

As you are eluding to the difference probably reflects how many readers come directly to the journals homepage to browse the ToC, Personally I only do this for a small selection of high impact journals (such as PLOS Biology and Medicine) or very focused journals relevant to my specific area of research(I suspect PLOS Comp Bio plays this role). Browsing the ToC of PLOS ONE hoping to find something useful is pretty futile due to the wide range of areas covered. Therefore PLOS ONE’s traffic probably comes almost exclusively from search engines and databases. It would be interesting to compare the click pattern from Pubmed. I would guess that the statistics would be pretty similar for all journals and be divided close the the 70/30 values of PLOS ONE.

Good point. I do think much of this reflects the way people find the article, with a Google search pointing one toward the journal version and a PubMed search pointing one toward the PMC version. But bringing impact and selectivity into the question is likely important. There are some journals where one subscribes to the electronic table of contents (or alerts) and others one just counts on stumbling across in a regular literature search. I’m not sure how well specialization correlates though, given that journals focused on pathogens and tropical diseases fare worse than broad journals on biology and medicine. Editorial selectivity for value seems to be playing a more significant role than subject-specific focus.

I have to question the diversion assumption that every PMC viewing of an article would have still happened at the publisher’s site if PMC did not exist. This seems like claiming that search engines do not aid discovery which is unlikely. While PMC may be diverting some eyeballs it is probably also bringing some new ones to the articles, possibly many. This is a complex issue.

Discoverability between versions didn’t seem to be different, so I don’t think there’s any difference to be found there. Both versions seemed equally discoverable, and both are free. So I think the answer lies elsewhere. Also, because PMC hosts a version of each article, it probably isn’t driving traffic to PLoS, but to itself. In fact, the design of the PMC search interface puts the PMC version first and foremost, over the publishers version. PMC is competing for traffic, and putting the publisher’s version behind theirs, which makes it less likely that PMC is driving traffic for PLoS.

I do not understand your response so you may have missed my point. PMC covers many publishers so its discoverability is probably far greater than any single publisher. Discoverability is measured by the effort required to find what one finds, or the probability of finding it, or some such. We do not yet have a good science of discoverability, or what I prefer to call findability since this is not scientific discovery in any case.

The basic point is that you cannot assume that an article found by searcher x on PMC would have been found instead by x on the publisher’s site if PMC did not exist. Many would simply not have been found in which case there is no diversion. I am sure there is some diversion but how much is not known.

Of course if PMC did not have its own articles and simply directed people to the publisher’s version there would be no diversion, but that is a separate issue.

This is why I tested a few articles. I couldn’t find a pattern of PMC providing any advantage in discoverability. You say “its discoverability is probably far greater.” That thought occurred to me, too, and I tested it. I couldn’t detect any advantage or pattern. If you can, please do tell.

What pattern are you referring to, that measures discoverability? I do not understand what you are saying so we must be using different concepts of discoverability.

Do we need to ask why people are looking for content? What questions are they seeking to answer? That might go someway to answering this as well.

For example if a clinician is looking for the latest, or the obscure, on some disease then he/she probably doesn’t care at the outset where it comes from and will go to PMC (or Google) to search precisely because they don’t know where to look. Branding doesn’t help here (and searching sequentially at Nature, then NEJM, then Lancet etc is not a realistic approach for quick answers).

Researchers in the field might well be attuned to the branding and know where to go. But they aren’t the totality of the users of content. Perhaps this ‘missing’ traffic is from outside the primary market.

I agree, this is essentially a restatement of the branding angle — after all, the primary market is going to identify itself with a publication via branding, and use it preferentially because of that identification. If the brand doesn’t identify with a community as strongly as “medicine” or “biology” (e.g., “pathogens”), then users aren’t as likely to seek it out by affinity.

Your mention of Google as a possible reason readers might get siphoned off to the PMC version is possible, though the more obvious one is the fact that on doing a PubMed search and displaying the results as abstracts, at the bottom of each abstract is a link, if available to the webpage of the journal AND again if available the PMC article. Since many of the journals shown do not have free content, the reader may not distinguish and assume that PMC is always the best option.

PMC searches are a little more nefarious than that, because PMC puts a link to its version on the list of search results, but you need to go into an abstract to see the publisher’s link. PMC is driving traffic to itself first and foremost, and that competes for traffic. This has commercial implications for every publisher that depends on traffic, including OA publishers, and isn’t an appropriate role for the US government — competing with digital publishers.

This is the key as far as I’m concerned here, that PMC deliberately favors its own version over that of the version of record.

Do a Google search for “dynamic transmission of dengue fever”. First you get a list of scholarly articles that match, which takes you directly to PLoS Neglected Tropical Diseases for an article titled, “Modeling the Dynamic Transmission of Dengue Fever: Investigating Disease Persistence”. Next you get the actual first result, which is the PLoS NTD article itself. Then you get the PubMed entry, and the link takes you to the abstract in PubMed, which features an equally prominent link for the PLoS NTD version alongside the PMC link. Google searchers are thus more likely to go to the PLoS NTD version than the PMC version.

Then do the same search on PubMed. You get a list of results, number 8 of which is this same article. On the results list, there’s no link to the PLoS NTD version, only a highlighted link to the PMC version. Searchers have a choice of going directly to the PMC version or going to the PubMed abstract version where they will find a link to PLoS NTD, adding another step to the process.

Clearly, a significant number just go with the promoted link in the PubMed search results, draining traffic away from the journal of record. I don’t think this has anything to do with whether the journal has free content, it’s a matter of convenience, and a way that PMC has set things up to drive their own traffic preferentially over that of the journal.

Convenience. PMC is a large department store, if the shopper can find everything s/he wants in the large store why go to another store to find the same thing?

In this analogy, Google is Amazon, then, so why would people on the computer go to PMC rather than PLoS? There is no difference in convenience.

A librarian 2 cents based upon a patron interaction yesterday – researchers, particularly those in the medical field simply think that PMC is a fantastic resource…considered a “Times Square” for their interests. Some never venture far from the PMC search page.

Sorry to pick nits, but I think you’re confusing PMC (PubMed Central) with PubMed, the searchable citation index. This is a common problem that Kent has mentioned frequently, but has regularly been denied by many in blog comments here.

It’s an important thing to note–there’s MedLine, which indexes journals and has a rigorous acceptance period. This takes a long time and the journal must prove a high level of quality. This used to be the only way to get an article listed in PubMed search results.

Now there’s PubMed Central, which has a much lower standard for acceptance of journals (and apparently a variable set of requirements). Journals can now bypass the rigorous MedLine process, deposit articles in PMC and have them turn up in PubMed search results.

For most users, this difference is not known. They assume that if a journal turns up in PubMed search results, that it has been vetted by MedLine. Most are unaware of the shortcut that PMC provides and many journals use this to their advantage.

MD/Researchers trust PMC and know how to use it. One could call it branding, but PMC has been around a lot longer than Google and I would guess most of those who use it are used to using it. and know its idiosyncrasies.

Well, a few things here. MDs and researchers use PubMed, and probably don’t know the distinction between PubMed and PubMed Central. But over the past decade, based on what I’ve seen of search behavior of MDs and researchers across many titles, Google kick PubMed’s behind on usage. Because it’s so comprehensive and quick, it’s actually preferred by a ration as high as 20:1. So I think this is not born out by the facts.

Having just pulled those numbers, we see about a 10:1 ratio of traffic from Google: traffic from PubMed across all our journals.


Interesting. Do you know if it is MD and researchers who are using Google or are the Google hits the result of the general public using it.?

I don’t think we correlate referrer with reader. One could possibly do this by looking at the domain from which the reader is accessing (is it a .edu for example, or is the reader at a subscribing IP address?). It might give a rough estimate but it’s not something we’ve undertaken.

Very interesting David, how about Google Scholar? I have assumed perhaps quite wrongly that researchers use GS not G, but that is just because I do.

I don’t know if we can jump to that conclusion. Is the Google preference a reflection of general population usage? Do we know that MDs and researchers are using Google more than PubMed?

Yes, that’s what we’re saying. I have it via data, focus groups, surveys, and interviews — all confirm the same thing, which is that Google is much more heavily used than PubMed.

Kent: So you are saying that MDs and Researhers are using Google to find information on topics they are interested in more so than PubMed. That is interesting.

Yes, I first saw it about six years ago, and we were incredulous. But in every setting I’ve been in, and in discussions with other publishers, it’s clear that it’s true. Focus groups confirm it. When one doctor or researcher admits it in a focus group, there’s typically a flood of relieved confirmation, as if it’s everyone’s dirty little secret. Kind of like using Wikipedia — it’s becoming more common for professionals to cop to it now, as well.

And for the record, another example of confusion between PMC and PubMed itself.


Commented on this earlier. I basically only use PubMed. I wonder how many are like me and really don’t consider PMC when doing some research but think or consider PM and PMC to be the same thing.

Reminds me of the book rep, the college prof, and the student being in a room and each referring to the same book as:

Book Rep – the McGraw book

Prof – The Smith book

Student – the one with the bee on the cover.

PMC can only promise that it hosts some version of the article, not necessarily the version of record. Thus it seems puzzling that so many people would prefer to use the PMC site in preference to the publisher’s. I guess many people simply don’t care about the difference between Green OA and version of record?

Or don’t access to a research library can’t afford to pay $35 for the version of record particularly when you can always tell from the abstract it’s what you need. Do you all have any idea how arrogant condescending you sound?

Yes, and if you read the thread it is clear it broadens out to include subscription journals. I was specifically referring to Sandy Thatcher’s post and David Crotty’s reply. PLoS uploads the version of record to PMC as I suspect every other OA publisher. It’s only subscription journals where you are likely to get the accepted version.

Creating another point of confusion. Sometimes when one downloads from PMC one gets the final, published version, sometimes a draft version of a paper. If I download a paper from a journal that uploads the final published version which I prefer, I may think that’s the case for all journals (and fall into the “don’t know” category, rather than the “don’t care”).

For the record, OUP automatically uploads the final, published version of the paper for our authors, not the accepted manuscript version, so one can’t even draw the line here at OA/subscription types of journals. Regardless, even the final published version uploaded to PMC will not include any retraction notices, corrigendums, or corrections, and will lack related content like commentaries and editorials that is often linked to the journal version.

David: Can you provide a list of just who publishes the final version of record for a paper. Just what does PMC provide, PubMed provide and Med Line provide and the publisher provide? There seems to be much confusion.

Publishes? That would be the journal and many articles are updated, corrected, retracted, etc. The version that’s in PMC may not reflect any of those changes.

Not sure on which journals upload the initial published version rather than the author’s accepted manuscript (pre-editing, pre-typesetting). This might give you some information, but I have no idea how accurate it is:

To give a quick overview of Medline, PMC, PubMed, mostly cribbed from Wikipedia:
MEDLINE (Medical Literature Analysis and Retrieval System Online) is a bibliographic database of life sciences and biomedical information. It includes bibliographic information for articles from academic journals covering medicine, nursing, pharmacy, dentistry, veterinary medicine, and health care. MEDLINE also covers much of the literature in biology and biochemistry, as well as fields such as molecular evolution.

More than 5,500 biomedical journals are indexed in MEDLINE. New journals are not included automatically or immediately. Selection is based on the recommendations of a panel, the Literature Selection Technical Review Committee, based on scientific scope and quality of a journal.

PubMed Central is a free digital database of full-text scientific literature in biomedical and life sciences. It grew from the online Entrez PubMed biomedical literature search system. PubMed Central was developed by the U.S. National Library of Medicine (NLM) as an online archive of biomedical journal articles.

The full text of all PubMed Central articles is available free.

PubMed is a free database accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics.

To sum up very roughly, MEDLINE is a database of information about medical and life sciences articles. Inclusion in MEDLINE requires a rigorous and lengthy review process. PubMed Central is an online archive of freely available research articles, with lesser review criteria than MEDLINE. These articles are deposited either by the author of the article, or on the author’s behalf by the journal that publishes the article. PubMed is a tool for finding information about articles. PubMed results used to just draw from articles that had passed muster with MEDLINE, but now it also brings up any articles that are in PubMed Central as well.

Given that many, if not most subscription journals make the version of record freely available at the same time it is made available in PMC, that can not account for the entirety of the traffic that’s seen there. Do you have any data that offers insight into PMC usage by those without subscriptions versus those at research institutions with subscriptions to the journal in question? How does this apply to a fully OA journal like those described in this blog posting?

Sandy’s original comment suggested that researchers don’t care whether they’re looking at the author’s accepted but unedited manuscript or the final, and often updated, version of record published in the journal. I’m suggesting that readers actually do care, but given the enormous amount of confusion over what exactly is PMC, PubMed and MedLine, and a lack of understanding of the differences between the version submitted to PMC and that in the journal, it is unclear to many exactly what they’re choosing. The comments on this very post show the continued level of confusion, and I can’t tell you how many times I’ve been asked by an editorial board member if we deposit articles automatically in “PubMed” (we do automatically deposit, but in PMC).

Pointing out this confusion is not “arrogant condescending”. It is a confusing setup with deliberately blurred lines that could use clarification.

“Is there something in Google diverting people? I tested a few articles, and couldn’t find a discernible pattern.”

I don’t trust any of the references to Google search results in the preceding comments, for the following reason. In my apartment, there are three computers connected through a router to one link to my ISP. I am the almost exclusive user of my computer. The majority of my Google searches are related to my work. My wife is the almost exclusive user of her computer. One morning she had ask me a question related to her work. In researching how to answer, I did some Google searches on my computer. Subsequently, I went to her with my proposed answer. She had some objections. I said, “Look, I did a google. I’ll show you.” And I quickly type the relevant search string in the Google search box in her browser. I very well remembered the first three hits in my search results. Those three did not show up on the first page of ten hits on my wife’s computer, nor on the second page.

The point is that Google search results on a given computer, at least in my experience, depend heavily on the pattern of use on that particular computer. I have a reason to suspect that if PMC is frequently accessed from a certain computer, then Google will put PMC links high on its list of search results; on the other hand, if certain journals are frequently accessed from that computer, then links to those journals will tend to placed high on the list of Google search results.

Google has been widely criticized for creating what is called the “Filter Bubble” (http://www.searchenginejournal.com/the-google-filter-bubble-and-its-problems/29879/). Essentially, Google tries to return search results that are like the results you have clicked on in the past. It’s the main reason (among many) why I’ve switched to DuckDuckGo as my main search engine (http://www.duckduckgo.com) because they don’t track you and don’t modify your results based on past behavior–they just give you the most relevant results each time.

That said, if you are not signed in to any Google service, you should, at least theoretically, get a fairly pure set of results that is common across all computers. For what it’s worth, I’ve just repeated my search for “dynamic transmission of dengue fever” across three different browsers on three different computers and received the same results each time.

But you’re right in that if Google is tracking user behavior, and the user in the past has used PMC quite a bit from Google search results, those are going to be ranked higher in future results.

My view is that PMC should be a portal directing people to publisher’s pages not a repository, provided the publisher makes the document freely available in a timely fashion. This seems like a relatively simple policy change that solves the diversion problem.

This hasn’t been mentioned yet, that I’ve seen, but one reason why PMC gets used over the publisher’s version is that the PMC version is more stable, has a familiar layout for each item, and doesn’t lose things like appendices and data supplements the way publisher versions are wont to.

I was going to make this same point — from a user’s perspective, the predictability of the PMC page layout and interface shouldn’t be taken for granted.

In addition, page load times can be critical in terms of making users happy and more likely to come back. It could be that once someone has had a good experience with PMC, they don’t feel a need to click through to the publisher’s site. I always found PLOS’ pages quite slow to load in comparison to other publishers and PMC.

(This doesn’t discount the effect of PMC’s link placement in PubMed search results; of course this affects users’ behavior.)

As for myself, I was often rather annoyed at the various idiosyncrasies of publisher-run websites while doing literature research in grad school. It was a bit of a relief when I happened upon an article that was deposited in PMC, as I knew where everything would be on the page and what would happen when I clicked on references.

Of course, it’s impossible to actually disentangle any of these potential causative factors given the paucity of data to analyze. But it’s still interesting to think about.

You both raise an interesting point, and the standardization of format/layout is an advantage that aggregators have over original sources of content. I’d throw in the caveat that each journal has a different style and different types of content, so there’s still a lack of uniformity for what one finds for each paper.

It’s also a phenomenon that’s specific to the html version of the article. As far as I can tell, PMC uses the pdf version that’s uploaded by the user/journal. Is there a correlation then, between the journals that see higher PMC usage and journals that see a higher ratio of html/pdf usage among readers?

Comments are closed.