pubgetPubget is a software for searching the biomedical literature.  It’s different from other search engines in that it automatically downloads full image (PDF) articles from the publisher and displays them alongside the search results.  The product is an interface that eliminates the step between finding and retrieving, something the creators of Pubget believe saves researchers time that “could better spend curing disease and building the future.”

While the Pubget service is free, the company’s business model is based on selling advertisements to lab equipment and pharmaceutical companies.

You could view Pubget like a news aggregator, framing publisher content in their own interface and adding value.  However, the revenue sources for Pubget are the same sources of revenue for many medical journals.  I asked Ryan Jones at Pubget if he considered their service to be in conflict with journal publishers. Since Pubmed helps connect readers to relevant articles, he replied, the response of publishers has been favorable.

[Journal] publishers are generally receptive to us because, by driving usage, we support their core subscription business model. They can see that content consumption rates are higher for Pubget users.

The combination of a search engine with an institution-based authentication system and article link resolver is unique to Pubget.  Cornell University is currently listed as one of the institutions, along with about 90 others.  While the speed and convenience of having articles downloaded automatically as I display my search results is welcome, I do wonder how this feature ultimately saves the time of the researcher.

The Pubget search interface is relatively simple compared to NIH’s PubMed (the source of much of Pubget’s metadata), and does not allow post-search narrowing and refining, common features of most literature databases.  So while I may save time by not having to click on a link to the full-text article, I certainly spend more time skimming through my search results.

I also question whether providing the PDF view of articles is really a time-saver for browsing.  Publishers have spent years trying to understand how scientists read the literature and have developed features that scientists want to see on a digital platform.  Publishers have brought us linked references, embedded tables, images, video, and in-text citations, among other improvements. If the image of an article was what scientists really wanted, publishers could have stopped developing somewhere around 1995.

Librarians have expressed other concerns with Pubget.

Writing on liblicense-l, Andrea Langhurst, Licensing and Acquisitions Librarian at the University of Notre Dame questioned whether mass downloading of journal articles would be considered a violation of publisher licensing agreements.

Many publishers have systems that prevent systematic downloading of digital content, blocking users when they exceed thresholds.  The impetus behind these systems is not to prevent use, but wanton abuse.  In the latter case, subscription-access publishers worry about wholesale theft of content and a lack of control over one’s business and distribution model.  An open access publisher may not make any such distinction in how their content is used.

Langhurst also questioned whether participating with Pubget would make collection analysis more difficult:

I’m also concerned that this would dramatically increase our usage statistics and make it difficult to do any sort accurate cost/use analysis if the service is providing a mechanism for mass downloading of content from provider sites.

Mass downloading may also obscure attempts to develop value metrics such as the JISC supported Usage Factor, and lead possibly to an arms race among publishers to deliver as many articles as possible.

While I have no doubt that Pubget will become more sophisticated and useful as a literature tool, I see the this service as the first of many upcoming attempts to aggregate, repackage, and redistribute the academic literature.  And unlike Google Books, no one had to do any scanning.  I would not be surprised by the appearance of similar services which aggregate peer-reviewed manuscripts deposited into PubMed Central as part of the NIH Public Access mandate, or a commercial service which scrapes institutional repositories for journal manuscripts, removing the institutional branding in the process.

Phil Davis

Phil Davis

Phil Davis is a publishing consultant specializing in the statistical analysis of citation, readership, publication and survey data. He has a Ph.D. in science communication from Cornell University (2010), extensive experience as a science librarian (1995-2006) and was trained as a life scientist.


9 Thoughts on "Pubget: Time-saver or Content Aggregator?"

I think there is a fundamental conflict here with journal publishers. Journals generate traffic, and they sell ads based on that traffic. PubGet wants to take that traffic away from the journal and to PubGet, and sell the same ads to the same companies. I expect to see most journals blocking this sort of hijacking attempt.

The direct line to the pdf also stops many interesting new ventures like article ratings or commenting systems, as those are done on the html versions of papers, not the static and disconnected pdf.

There are good points raised here, however I think the suggestion that publishers’ efforts to expand the functionality of html versions doesn’t necessarily reflect that this is what scientists want(ed) (‘If the image of an article was what scientists really wanted, publishers could have stopped developing somewhere around 1995.’). As an extension of David Crotty’s comment, it needs to be considered that traditional models of online revenue rely on web traffic around articles, and this is an important source of potential bias when comparing the efforts of scholarly publishers and the interests of their readers.

I’ve certainly advocated new online tools for publishers, but frankly have never seen (and probably more importantly, never looked for) research on the reading preferences/habits of representative samples of scientists. If current online platforms were designed with a bias towards increasing traffic (or more benignly to test out new tools), the usage/download patterns from these sites is not a source for clear answers…

It is interesting to consider future radical changes in connectedness in which access to the tools around articles is not centralized at one publisher’s website – and local ‘image copies’ are seamlessly connected to the same features now restricted to online/html versions.

As a health care practitioner who frequently accesses the professional literature, I would find such a tool very useful. However, I am concerned about the broader issue here which is that tools like this are likely to ultimately make the subscription and advertising-based scientific journal a thing of the past. As more and more of us stop subscribing to journals, and access what we want from them over the web instead, the subscritions and associated advertising revenue of these journals will dry up and so will they.

The problem is that professional journals offer more than information; they offer information that has to some degree been assessed for reliability and accuracy. Articles are routinely peer reviewed by independent experts in the field who know the existing literature in the field as well as the principles of scientific research design. They use that information to determine the value of a given paper. As anyone who has published in such journals knows, far more papers are rejected than are accepted, the usual reasons for rejection being that the research design is flawed or the results are trivial. When results are unexpected or contrary to existing information, replications of the findings are often requested from the paper author or from independent sources. Editors will also, in many journals, append editorials or summaries of reviewers’ comments to allow the reader to have a more balanced context in which to evaluate the value of a study. Further, most journals require verification that the study was conducted in an ethically acceptable way. Likewise, journals typically require a statement as to where the funding for the research came from so that readers can consider whether factors such as funding from a pharmaceutical firm might have had an impact on the results. As studies move directly onto the web without such review it will become increasingly impossible for readers to determine the quality of the research, whether it was conducted in an ethical way, and whether it was paid for by individuals with vested interests. There have been numerous disclosures of subversion of the scientific literature by pharmaceutical firms in the recent past, and as the editorial process is reduced and ultimately eliminated we can expect these abuses to increase.

It took many long years to develop the editoral oversight system we have with the published journal. Sadly, I see little being done to transfer this process to the web and think this must become an overriding priority of we are indeed to move all of our scientific reporting to the web. Otherwise, the term “virtual reality” will take on a new and much less favorable meaning as we substitute “virtual truth” for actual fact.

Isn’t there going to be a copyright issue here shortly?

PubGet want to make money off advertising and the logical approach to do that is to mine the text of the articles in order to be able to place relevant ads next to the research results.

That approach would be in breach of copyright I think, unless they are going to enter into agreements with publishers for a revenue split on the ad income.

I agree with David up at the top – this looks and smells awfully like a hijacking program.

Any copyright issue would be the same here as for any search engine. I don’t think anyone has challenged Google’s ability to spider the web and sell ads against content in their search results. The lawsuit that brought about the proposed Google Books Settlement did ask that question, challenging whether Google indexing book content (even if that content was never displayed) was fair use. Unfortunately, the case never made it to court, as the settlement was worked out instead.

Most journals allow search engines to spider and index their content, and ads are sold against searches that bring up the content in results. The big difference is that the search engines send traffic to the journals, where PubGet would not.

At the risk of being ‘orribly pedantic, I’ll explain why I think the Google approach is a bit different and thus doesn’t attack copyright directly in the manner that I see PubGet might possibly do…

Ok, For search results (Adwords), the ads are actually set against the incoming search terms, not the actual text that can be found in the search results. Now, clearly there is a connection between the search, Google’s spidering of the web and the ability to lay ads alongside snippets of content, but, you can opt out (with robots.txt) and the ads are shown not as a direct result of the text in the search results. I think this is why it has never been challenged (seriously).

Adsense, does use site text for context based adverts, but again, the nature of adsense is such that the responsibility for copyright adherence lies with the person signing up to the adsense scheme.

I’m am speculating of course, but my original comment was based on the idea that PubGet would have to use the journal texts to develop a context based advertising model, and thus, unlike Google they would directly be running up against the copyright restrictions. Also over here at least, copyright extends into databases… where the process of harvesting a database can sometimes also be an issue, regardless of what’s in it.

Thanks for the clarification. I’m not sure how PubGet is generating their ads, whether they’re based on the content they’ve spidered or on the search term used (as Google does). It’s an interesting question though, perhaps one could argue that it’s fair use, and in particular, PubGet is asking publishers for permission to spider their content, so that may be part of the deal.

One thing also to note is that Google is currently facing a trademark lawsuit over selling companies’ trademarked names as adwords, so it’s unclear if their approach is wholly legal.

I, for one, would like to applaud the work of Pubget. My university spends an enormous sum gaining access to the medical literature but when it comes time for me to actually access a relevant article, the process of finding the PDF is incredibly burdensome and actually a deterrent to the perusal of the literature. This barrier is likely part of the reason why many in the health sciences make decisions about an article solely based on the abstract – the full text is just too hard to get to, even if one has an institutional subscription.

Pubget has come up with a solution to a significant problem that the publishing industry and academia should have solved years ago.

If they need to generate income from contectual advertising, we should be pleased that this service has been made available, rather than question the validity of their approach.

If there are copyright issues, I suggest we leave that to the publishers and Pubget to sort out. In any event, no one has actually provided any documentation that Pubget is actually spidering PDF content. I suspect that the search terms, name of journals/articles and user search history may provide adequate content to create good contectual ads without even needing the review the PDF itself.

Much luck to Pubget and I hope they succeed in revolutionizing access to the underutilized scientific literature.

NB I have no relationship with, other than being a dedicated user.

Comments are closed.