Guest Post: HighWire's John Sack on Online Indexing of Scholarly Publications: Part 1, What We All Have Accomplished

Editor’s Note: this is the first of two posts from HighWire Press’s John Sack. John is the Founding Director of HighWire, which facilitates the digital dissemination of more than 3000 journals, books, reference works, and proceedings. HighWire started a blog this summer, and it has already become a valuable source of information and critical thought. It is increasingly vital for our industry to bring more voices to the table and for the stakeholders in scholarly communication to have their voices heard. We are cross-posting this piece, along with next week’s follow-up on both blogs, so please do take a look at what the HighWireblog has to offer.

“Let no one tell you that ‘Scholarly communication hasn’t changed”

HighWire conducted its first extensive user studies in 2002. Since then, several things have completely altered the workflow of the researcher:

full text of most current journal articles is centrally indexed;
back archives of a significant fraction of the full text research literature is online, and centrally indexed as well.

“Centrally indexed” was a watershed point. In 2002, Google’s web search (i.e., google.com) started indexing the full text of journal literature — including the portion behind paywalls – starting with HighWire and its publishing partners. HighWire saw the use of journal article content go up by one and in some cases two orders of magnitude following this! And then, in 2004 Google Scholar was born, scholar.google.com, recognizing that the workflow and goal of a researcher is not best-supported by a general-purpose internet search engine, no matter how good its ranking algorithms are.

Now, a decade after our first user studies, users report to us that, “Finding is easy; reading is hard.”

This transformation in discovery – and its consequences – was the topic of the opening keynote at the September 2015 ALPSP Annual Meeting. Anurag Acharya – co-founder of Google Scholar – spoke and answered questions for an hour. That’s forever in our sound bite culture, but the talk was both inspirational — about what we had collectively accomplished — as well as exciting and challenging – about the directions ahead. Anurag’s talk and the Q&A is online as a video and as audio in parts one and two

This post is in two parts: the present Part One covers Anurag’s presentation of what we have accomplished. Part Two, to be posted on Monday, October 12, covers the consequences. Anurag has agreed to address questions that readers put in the comments.

Here is my take on the key topics from Anurag’s talk.

Search is the new browse

Prior to the introduction of web-based search engines in the 1990s, researchers would select their reading by looking at issue tables of contents – an article list selected by the journal editor. Reference lists in an article were also browsed; this was an article list selected by the article’s authors. A few fields featured primitive indexing via publications like Current Contents.

In the mid-1990s, tables of contents (TOCs) began to be emailed, which saved many trips to libraries. And even today “eTOC” reading is still a common part of the researcher workflow. The eTOC allows researchers to skim perhaps 8-10 journals regularly, where earlier we heard anecdotally that researchers were able to cover only 3-4 journals regularly in print.

By the mid-2000s, the editor- or author-assembled list of articles in TOCs and references was replaced by a list assembled dynamically in response to a search engine query that suited the individual’s requirement at that time. Search became the new browse.

In the mid 1990s, the scope of what you could cover in your current-awareness was essentially limited to what you could scan and recall from what your library, and your personal or departmental/lab subscriptions, made available. Ten years later, these limits were gone. Now you browse relevance-ranked (not just date-ranked) search result lists. It is possible (though Anurag did not speak to this) that the “Just in Case” scanning of journal TOCs to stay informed on your subject is now being replaced generationally by “Just in Time” scanning of search results.

High-quality relevance ranking that understands ‘scholarly filtering’ was a huge step forward. But relevance ranking is not all the story.

Full text indexing of current articles plus significant backfiles joined with relevance ranking to change how we searched and what we did with the results.

Sometimes it takes a combination of factors to change a workflow. (E.g., Uber would not have made much of a difference in urban transportation unless it had mobile phones to run on.) Broad search engines of the 1990s indexed only abstracts for the most part; full text indexing allowed searches for details such as methods, conclusions, specific assertions (‘x catalyzes y in the presence of z”), and drug interactions to be found. As Anurag said, “full text indexing allows all parts of articles to rise.”

Huge backfiles of a significant number of journals went online in the early/mid 2000s, in part because their discovery became possible. This, combined with scholarly relevance ranking, effectively allowed historical portions of the research literature to rise from the previous bias of most-recent-first ranking.

One wonders whether this combination of backfiles and full text indexing would have made a difference in some fatal situations in clinical trials in 2001. We don’t often think of our work as involving life and death; but perhaps it does.

“Articles stand on their own merit”

The ‘democratizing’ effect goes beyond full text allowing articles to ‘rise above their abstracts’ and back literature to rise above current articles. This effect also enables “articles to stand on their own merit” in a search result list, not primarily on the merits of the venue (journal) in which they appear – the “distribution pyramid has flattened,” as Anurag said.

“Bring all researchers to the frontier”

A further ‘democratizing’ effect of the freely-available and comprehensive scholarly-content search engine is that it “helps bring all researchers to the frontier”. That is, scholars beyond the world’s premier research institutions can now discover what they should read.

“So much more you can actually read”

Of course, finding isn’t reading (even Cliff’s Notes don’t claim that, much less Google’s snippets). There is “so much more you can actually read”, as a benefit of free back archives (which about 300 HighWire-hosted journals provide), preprints, repositories, open access journals and open access articles within subscription journals, and ‘big deal’ licenses. And where there are multiple copies of a work online, Scholar’s “subscriber links” and Open URL “library links” can help an institution-based reader find the available copy.

Anurag concluded his historical view by saying, “Let no one tell you that ‘scholarly communication hasn’t changed’”.

In the Part Two of this post, I will cover Anurag’s view of “What Happens When Finding Everything is So Easy?”

As noted above, Anurag has agreed to address questions that readers put in the comments.

John Sack

John Sack. John is the Founding Director of HighWire, which facilitates the digital dissemination of more than 3000 journals, books, reference works, and proceedings. John is also the Co-Director of the International Congress on Peer Review and Scientific Publication.

Discussion

56 Thoughts on "Guest Post: HighWire’s John Sack on Online Indexing of Scholarly Publications: Part 1, What We All Have Accomplished"

I am a huge fan of GS. In fact I use it for scientometrics, because (unlike Google) the hit counts provide meaningful information about the size of the research community looking at a given topic. I have several papers in progress that describe GS searches instead of providing citations. Of course this screws up the impact factor, but that is not my problem. Why point to specific papers when you can point to a community instead? More flattening.

By David Wojick
Oct 5, 2015, 8:34 AM

Thank you for this really (!!!) interesting post. I shall be watching that recording. The ease of the discovery layer may be one of the most significant changes in scholarly publishing since movable type. It certainly put a great deal more information in front of readers, students and researchers whether or not they were associated with library (or a library with a subscription to A&I databases). Very much looking forward to part 2!

By Collette Mak
Oct 5, 2015, 9:37 AM

About the superseding of ” ‘Just in Case’ scanning of journal TOCs to stay informed on your subject … by ‘Just in Time’ scanning of search results.”–This might produce more shallow scholarship.

By pbrown
Oct 5, 2015, 10:05 AM

One of our interviewees in HighWire’s interview series called this “the tragedy of the shallows” (in a nod to “the tragedy of the commons”.

By John Sack
Oct 5, 2015, 11:30 AM

Another of our interviewees commented “Because of the power of keyword searching, I find only what I’m looking for”. And this wasn’t a positive thing for this person. He described the serendipity of an editor’s selection of articles for an issue, vs. a keyword search.

Of course, we probably all have had the feeling of getting some pretty “serendipitous” results from a keyword search…

John

By John Sack
Oct 5, 2015, 12:47 PM

Supporting serendipity at scale is indeed a challenge. And an important one. How can we help everyone to operate at the research frontier without them having to scan over hundreds of papers? This is one of the areas we would like to do more in.

There is an inherent problem to giving you information that you weren’t actively searching for. It has to be relevant — so that we are not wasting your time — but not too relevant, because you already know about those articles.

Our first step in this direction was the introduction of recommendations based on authors’ public profiles. This takes into account the author’s recent publication history as well as her “invisible college” — co-authors, colleagues, citation relationships etc. This usually works quite well. Feedback from users indicates that it often helps them find articles relevant to their work that they would not have been looking for.

This is but the first step. There is much more to be done – researchers who with a short publication history, practitioners who don’t publish much any more (eg, chemists, doctors, and of course engineers like me), shorter term exploratory interests.

By Anurag Acharya
Oct 5, 2015, 3:22 PM

Dear Anurag,

I’m really appreciating the opportunity to ask you a question directly. I think Google Scholar is great, and I use it every day. However, there are a few minor issues with Scholar that bug a lot of people, mostly related to the handling of preprints. In particular, there is a bug where a preprint can completely shadow the later journal publication, to the point where the publication literally doesn’t seem to exist in the Scholar database even though Scholar clearly has indexed the journal issue. See this blog post for details: http://serialmentor.com/blog/2014/11/1/the-google-scholar-preprint-bug/

A related issue is that Scholar often links to the pre-publication version of the article, even if the final article has long been published. As an example, consider this paper from 2012:
https://scholar.google.com/citations?view_op=view_citation&hl=en&citation_for_view=Nc8U6E4AAAAJ:mvPsJ3kp5DgC
If I click on “[HTML] from oxfordjournals.org” I’m not sent to the most recent version of that paper but an earlier one.

Do you think that these issues can be fixed? The community would really appreciate it!

Best wishes,
Claus

By clausowilke
Oct 5, 2015, 12:56 PM

I want to echo Claus’s sentiment. I’m currently experiencing this bug. I’ve exhausted every possible means to get in touch with the Scholar team.

As Claus has previously pointed out the bug is detrimental to open science by discouraging preprints. And it has been going on for years. Journals regularly receive complaints from their authors who are experiencing this bug.

What is going on? Is this bug going to be fixed? I have reached a breaking point: either you, Anurag Acharya, comment on this issue and work to fix it, or I, Daniel Himmelstein, will let my true feelings regarding this bug and Scholar’s muteness be known.

By Daniel Himmelstein
Oct 5, 2015, 1:31 PM

Dear Claus: thanks for the kind words!

Google Scholar indexing is designed to fit the dichotomous structure of scholarly publishing – new articles appear frequently, can have multiple versions and are of very high interest; older articles are archival and are expected to not change. Accordingly, we scan for newly published articles frequently and add them to the index several times a week. Since recently written articles can have newer versions or may transition from preprint to formally published or conference version to journal or ahead-of-print to final-version, we recrawl and reindex recently published articles _much_ more frequently. This usually handles the version transitions that occur early in the life of some articles.

We recrawl all articles and rebuild the entire index periodically to deal with all the changes that happen for older articles. This includes changes in article presentation, platform transitions, host transitions, grouping updates with new versions. This approach also optimizes the use of server resources on publisher sites since archival articles are recrawled less frequently. A slower rate of crawl has long been an explicit request from most of our publishing partners.

Overall, this approach works really well given that a large fraction of articles go through multiple versions (eg ahead-of-print) at this point. Occasionally, however, it can result in a few articles that were in an early stage version for a while to be indexed in the archival mode. This usually gets cleared up on the next rebuild.

As for most problems with conflicting requirements, workable solutions have to achieve a balance. One approach that would avoid all versioning problems would be to index final versions only. That isn’t desirable since it would make it harder to find new results as soon as possible. Another approach would be to recrawl all URLs aggressively. However, given the limited crawl rate we work with for many journal sites, this may keep newly published articles from being indexed in a timely manner – which isn’t desirable.

Any time there are multiple versions, it is possible authors may get concerned about what happens to the citations to the different versions. As you probably know, Scholar groups all versions of an article. It also groups citations to all versions. Citations that any of an article’s versions receive — preprint, conference, ahead of print, final journal version, anthology etc — are included in its citedby count in search results as well as on its authors’ profiles. Citation stats for articles as well as profiles are computed & updated multiple times a week.

cheers,
anurag

By Anurag Acharya
Oct 5, 2015, 4:12 PM

Hi Anurag, thanks for your comment. Crawling the web for scholarly content is definitely a complex task, one that Scholar has done a laudable job with. However, given the constraints you’ve mentioned, I still don’t understand the necessity of the preprint bug.

For example, when an archived title appears on a new venue, could a reindexing automatically be triggered? Or could preprints receive an extended honeymoon in the pre-archival mode? Even if you just fixed the problem for *bioRxiv*, *arXiv*, and *PeerJ PrePrints* that could go a long way.

Ultimately it seems bizarre to the general user that journal publications without preprints get indexed almost immediately, while journal publications with preprints are undetectable.

By Daniel Himmelstein
Oct 5, 2015, 5:00 PM

Anurag,

thanks for your comments. I completely understand the difficulties you face with when and how you crawl. However, I am absolutely convinced that there is a bug in how Google Scholar handles records, and this bug is unrelated to crawling frequency or whether or not you recrawl.

Let me try to explain. Let’s say three groups write three papers, A, B, C. Of those three, paper B is submitted as a preprint, while the others are not. So, at some point, the world looks like this:

Biorxiv
paper B has been published

Some time later, the three papers come out in PLOS ONE back-to-back (as an example). Now the world looks like this:

Biorxiv
paper B has been published

PLOS ONE
papers A, B’, C have been published (B’ is a slightly modified version from B).

However, the world according to Google Scholar often looks like this:

Biorxiv
paper B has been published

PLOS ONE
papers A, C have been published

In other words, even though Google Scholar clearly has crawled PLOS ONE, it somehow has lost the record of B’. And that happens *only* if a preprint of B’ was previously in the Scholar database.

If you don’t believe me that this happens, please try to find this article:
http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004259
in the Scholar database. As of today (Oct. 5, 5pm Central) it doesn’t exist, even though Google Scholar clearly has indexed the appropriate issue from PLOS Comp. Biol.

Kind regards,
Claus

By clausowilke
Oct 5, 2015, 6:06 PM

Dear Claus: Most preprints/ahead-of-print versions are indexed in early-version model — as they should be. Articles that are indexed in early-version mode are recrawled and reindexed frequently. Changes to their location, their content, their format, their versions are expected to be frequent; this allows changes to be picked up soon.

Occasionally, a preprint that has been in that state for a while can get indexed in the archival mode. When that happens, updates to that article (location, content, format, versions etc) take longer. Articles that are indexed in an archival mode are reindexed less frequently – as they must, if the indexing system is to use the limited crawl capacity at the journal sites effectively.

Any process that tries to handle both modes of articles needs to have a boundary somewhere. Given the diversity of publication processes worldwide, the occasional article can end up on the wrong side of the boundary. What the indexing system can and does do is to make this case infrequent and to clear up the mismatch at the next update.

cheers,
anurag

By Anurag Acharya
Oct 5, 2015, 7:47 PM

Anurag,

I’m sorry, but I need to continue challenging you on this. What you’re saying simply makes no sense. Let’s say a new issue of PLOS ONE comes out. When that happens, I’m sure Google Scholar crawls the entire issue and finds all articles. *However*, if now some of those articles already exist in the Scholar data base in preprint format, and are stored in archival mode, then Scholar simply discards the version it finds in PLOS ONE. And that’s a really bad decision. If the previous version is in archival mode, and that means you can’t merge the old entry with the new one (*), then it would be better to count the new version as a separate publication. Discarding a journal publication because a preprint exists is simply not acceptable.

Also, this case is not infrequent. I’d estimate it happens about 30%-50% of the time that I post an article as preprint. And every time I talk about it, online or in person, other people report that they have experienced the same issue. See the other comments in this thread.

Best wishes,
Claus

(*) Even though I truly, honestly don’t understand why you couldn’t. But I’ll accept you can’t.

By clausowilke
Oct 5, 2015, 11:33 PM

Hi Claus: I understand why you suggest splitting versions. However, splitting versions comes with negative consequences.

First, splitting versions would result in search result pages with duplicate entries. Most users are unhappy with search results with duplicate entries. As it happens, most of the new versions/locations seen for articles in archival mode are in fact due to journals moving platforms and not preprints. Introducing duplicate entries for articles already in archival mode will result in systematic duplication in search results. Which isn’t desirable.

Second, splitting versions would result in citation counts being split. Which is seen as undesirable by many authors.

Third, splitting versions would muck up the citation counts & cited-by lists for papers that the article in question itself cites. If/when the versions are merged at a later point, the merger would cause turbulence in cited-by counts for all cited articles – which would be hugely unpopular among the authors of those articles 🙂

While I understand you may have seen this occur multiple times but looking across the entire corpus, this is pretty infrequent. Keep in mind that a substantial fraction of new articles at this point go through some form of public versioning (preprint, tech report, working paper, conference article, ahead of print, final journal version etc – depending on the field of study).

I realize you would like the indexing system to avoid this infrequent case completely. Ideally, I would like to avoid trade-offs too. But, given conflicting requirements, we do have to make trade-offs. I expect you have similar situations in your work. All endeavors that try to solve hard & messy problems do.

cheers,
anurag

By Anurag Acharya
Oct 6, 2015, 11:37 AM

I still don’t understand. Do we agree that reindexing when an article appears in a new location does not require recrawling? Therefore, the bug should be fixable without a crawling/indexing tradeoff.

The only issue seems to be whether a new location for an archived article justifies reindexing. If the article’s original location is a preprint server, then yes!

Preprints are becoming more prevalent. The preprint bug is not infrequent. Every journal I (and others) have communicated with over this issue is aware and frustrated. I’ve even overheard wet lab biologists on my floor complaining. If you don’t think this is a major issue, it’s because Scholar doesn’t effectively collect user feedback.

Given that you created the service, I’m not going to presume that I understand the technicalities behind the bug better than you do. But I am going to say, please use your intricate understanding of the system and exceptional ability to solve complex problems to fix the preprint bug.

By Daniel Himmelstein
Oct 6, 2015, 1:07 PM

Daniel: Bear in mind that a preprint and a published article are seldom the same article. Changes almost always occur due to peer review and editing. If a single word is changed then it is not the same article to the computer. If you want the computer to guess that the published article is the successor to the preprint, that may require some fairly heavy duty artificial intelligence. And the computer would certainly have to index both versions to make that call.

By David Wojick
Oct 6, 2015, 2:23 PM

David,

what you’re saying is actually not relevant to the problem. The problem is that Scholar *does* recognize that the journal article and the preprint belong together, and then it *hides* the journal article from the database. A current example of this issue is this paper: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004259 which you cannot (as of today) find in the Google Scholar database, even though Scholar has indexed that journal issue. All you can find is the preprint.

If Scholar kept the article and the preprint separate, without linking them, that’d be much preferable to linking them and then hiding the official, authoritative paper. In fact, journal publishers should be up in arms. I’m surprised they aren’t making more noise about this.

By clausowilke
Oct 6, 2015, 6:03 PM

How many cases would it take for Google to treat this as a problem instead of a tradeoff? And its not clear why the just printed copy needs to disappear until the system “figures it out”?

By Ivan Baxter
Oct 6, 2015, 2:19 PM

Just to chime in, I don’t think it is really so uncommon for this particular use case; the case where a preprint was posted on a totally different system from the eventual publication.

I have stopped posting to indexed preprint servers altogether because younger scientists really can’t afford to wait 6 months before GS fixes the order. It seems the versioning for GS works fine so long as the versions all of the manuscripts appear on the same system. For the better, GS has become a critical service for much of the Computational Biology world and we as young scientists rely on it to represent our work. So much so that it is worth temporarily not releasing work to deal with the idiosyncrasies of GS.

Given the fact that GS has already crawled the publication where the final publication appears, it is not clear to me why it is such a computationally intensive task to simply internally reindex the entry in the GS database to show the the final publication as the top level entry. It’s not clear to me why it would require recrawling. As Claus showed, the final publication was crawled already. It’s just that GS chose not to internally reindex the entry.

By Austin
Oct 6, 2015, 8:17 PM

It seems like you and Claus are now merely complaining about the fact that GS continues to present the preprint instead of the published article, for some time after the article is published. There is a lot of delay here. Live with it. This is not a real time process. Be thankful that GS is real and free.

By David Wojick
Oct 6, 2015, 9:56 PM

I am curious about this Claus, because GS offers multiple versions of many articles. I just did a search on Wojick and GS offers 12 versions of the first paper listed. Clearly it is not throwing away duplicates. Perhaps something else is going on.

By David Wojick
Oct 6, 2015, 3:13 PM

Just because GS has correctly indexed and grouped some papers doesn’t mean it gets it always right. I have about a paper a year where the preprint keeps the official publication from showing up in GS. As an example, check out the paper I linked to in the other response. It is currently hidden from GS because a preprint exists.

By clausowilke
Oct 6, 2015, 6:06 PM

What do you mean by hidden, Claus? Is it not listed in the versions? Or just not listed in the initial search results?

By David Wojick
Oct 6, 2015, 9:58 PM

Hi David, your 2008 article was not affected by this bug because none of the 12 versions are preprints (versions posted before the peer reviewed version came out in Scientometrics). Furthermore, the bug generally resolves within a year.

I suggest reading Claus’s original and follow-up blog posts on the bug.

By Daniel Himmelstein
Oct 6, 2015, 9:49 PM

If you are still not convinced, see the the host of twitter reports: 1, 2, 3, 4, 5, 6.

By Daniel Himmelstein
Oct 6, 2015, 9:51 PM

I have the same thing with a PLoS One article that was deposited in ArXiv. here: http://arxiv.org/abs/1405.0518 and here: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0132184

Scholar does not identify the manuscript in PLoS One, grouped or ungrouped.

By Rebekah Rogers
Oct 5, 2015, 8:02 PM

Something I don’t quite “get” about Google is the difference between regular Google and Google Scholar as far as what is being indexed (Abstracts versus Full Text of articles). My general sense is that regular Google just indexes Abstracts, while Google Scholar indexes Full Text. But this is wildly inconsistent and some articles can be found by searching from text from deep within the article, while others cannot, and there’s no rhyme or reason that’s obvious or consistent (no clear differences in subject area, open/subscription access, etc.).

But I assume the general reasoning is that fewer users of regular Google want to read the scholarly literature, while Google Scholar users have deliberately chosen a tool to do just that.

What’s confounding then, is that regular Google does seem to consistently index the Full Text of scholarly articles that are posted (often illegally) on sites like ResearchGate and Academia.edu.

So my question is why Google doesn’t filter these out just as they filter out the same papers on legitimate journal websites? If the notion is that the general public doesn’t want to read papers, then why keep the Full Text info for the copyright infringing versions at all?

By David Crotty
Oct 5, 2015, 12:57 PM

For a given search, Scholar only indexes content that the user has access to. If the user has access to the full-text, the full-text is indexed. Otherwise, only the abstract is indexed (and then, only if the publisher has presented the abstract just-so, according to Scholar’s requirements). Google remains committed to the fallacious idea that all information ought to be freely available to all. Scholar penalizes content if it is behind a paywall, and Scholar assumes that researchers only care about content that they don’t have to pay for.

By E. Briggs
Oct 6, 2015, 2:55 PM

I believe that is not correct. Google Scholar does not change what it indexes based on the user. It indexes a corpus of information and provides results based on that singular index. From what Anurag said during his talk, Scholar does not track users (as Google does) and does not give them predictive answers based on previous searches (as Google does). My understanding is that while Google generally (but not always) does not index the full text of articles, Google Scholar indeed does, even those articles under subscription control.

By David Crotty
Oct 6, 2015, 3:04 PM

They do indeed index the fulltext, but a search result doesn’t reflect a fulltext index of anything the user doesn’t have access to. Scholar doesn’t track individual users (yet), but they require publishers to share the full list of their library subscribers including the titles the library subscribes to, and the library’s IP addresses. The guiding principle is that Google doesn’t want to send a user to content that the user doesn’t have access to. When a user clicks on a result in Scholar, the keyword match MUST appear on the page that the Scholar link takes them to. If the keyword match is in the fulltext, then the link has to take the user directly to the fulltext – not to a paywall. I’d be happy to have Anurag tell us I’m wrong about this!

By E. Briggs
Oct 6, 2015, 3:20 PM

Where did you find this information? It’s certainly the first I’ve ever heard about it. As far as I know, publishers do not give out their subscription lists to anyone–this is treated as confidential, proprietary data. Similarly, many journals still have a serious base of individual subscribers, and many research society journals work via RBAC and SSO systems, that aren’t IP range dependent. That wouldn’t seem to work in the scheme you’ve described above.

By David Crotty
Oct 6, 2015, 3:24 PM

Conflating issues here. Google Scholar will show users at a subscribing institution the material that they have access to first if the publisher is participating in the program. To participate, you do have to share your subscription information with Google.

By Angela Cochran
Oct 8, 2015, 11:47 AM

Thanks Angela. I think some of the struggles many of us have with Google Scholar is how secretive they are as far as discussing how things work. The academic research world is all about transparency and openness, so it’s difficult in some ways to cope with a major tool that won’t release any information about its processes. Clear, consistent and publicly available rules would be really helpful.

By David Crotty
Oct 8, 2015, 12:15 PM

Proof of concept:
If I search for text from deep within the fulltext of an OUP article from a computer that is on the OUP network, it brings me to that article.

If I search for that same text on my phone, which is not on the OUP network and has no subscription access to the journal, I am brought to that same article.

Hence, Google Scholar is not limiting my search results only to papers to which I have access. Try to search using this text below:

“shows the abundance changes of differentially changed proteins between the control and MeHg-treated marmoset cerebellum samples. The triplicate samples in the control group and the the MeHg treatment group were clearly separated and a good consistency in the protein pattern was found within each group”

It should bring you to this article:
http://toxsci.oxfordjournals.org/content/146/1/43.abstract

By David Crotty
Oct 6, 2015, 3:40 PM

A minor correction – not having any information to confirm or contradict your broader point – but It’s the library, not the publisher, that provides Scholar with its IP range and subscription details.

By Deborah Fitchett
Oct 12, 2015, 3:31 PM

Scholar has two programs (to the best of my knowledge: “Library links” and “Subscriber links”. The former is for libraries to provide these details; the latter is for publishers to provide these details. The combination can fill in a lot of gaps for the end users.

By John Sack
Oct 12, 2015, 4:34 PM

So far as I can tell, GS indexes all the major subscription journals and presents those results to all users. Subscription has nothing to do with it.

By David Wojick
Oct 6, 2015, 3:30 PM

Dear Anurag,

In your presentation, you cite your own study of top-10 journals and report their declining share of top papers. This seems to ignore the fact that there are many more papers published today than in 1995, see: http://scholarlykitchen.sspnet.org/2014/10/20/growing-impact-of-non-elite-journals/

If you increased the number of top journals (i.e. kept it proportional to the size of the growing literature), would your conclusions still hold?

By Phil Davis
Oct 5, 2015, 1:33 PM

Dear Phil: if the effect was primarily due to a growth in the number of articles, one would expect that the change would be similar in fields that have seen the same level of growth in the number of articles.

Looking at Table 1 of the paper, Physics & Mathematics sees an increase of 204% in the number of top-1000 articles in non-elite journals whereas Life Sciences & Earth Sciences sees an increase of 18%. If the changes were due to a process proportional to the number of articles, this would require that the growth in the number of articles for Physics & Math to be 10 _times_ the growth for Life Sciences & Earth Sciences. As it happens, and as one would expect, the growth in the two areas is roughly comparable; in fact, Life Sciences & Earth Sciences sees a larger growth. Given this, it is extremely unlikely that the effect is driven largely by the growth in number of articles.

I do want to mention that the study considered each of the 261 categories independently. Which means it covered ~2610 top journals overall (10 in each category) and ~261000 top articles (1000 in each category). I say this since the phrase “elite journals” can often be taken to mean 8-10 widely known journals (eg Science, Nature, NEJM, Cell and others).

As I had mentioned in my comment on your blog post, we had considered about the idea of using percentiles to determine the top-cited journals and articles. We picked the fixed numbers approach since it fits most publishing authors’ notion of elite journals and articles. If you ask ten colleagues across the campus about the “top” journals in their field, you will usually get a small number. When hiring committees or letter of reference writers mention “top” journals, again, they usually have the same small set in mind. It is _this_ shared perception of a small number of “top” journals that causes authors to seek out the elite journals in their field in the first place.

If you expand the number of “top” journals large enough, the effect would of course not be seen. But, it would no longer reflect the model authors/committees or pretty much everyone else have when they refer to “top” journals. Eg, consider trying to convince a hiring/tenure committee that your specific subfield has, say 100 “top” journals…

cheers,
anurag

By Anurag Acharya
Oct 5, 2015, 4:40 PM

It would be interesting to see the results done on a percentage basis as well. There are academic programs that offer rewards to authors for publishing in the top X% of their Impact Factor category for example, and these are more common than requiring a publication in the top 10 journals of a category for example.

I suspect also that the raw number of “elite” journals in any field has expanded over the last decade or two, think about all of the Nature spin-off journals that have been launched, the high end PLOS journals, the Cell spin-offs, etc., etc. I’m not sure if that expansion has been great enough to dilute out the findings though.

By David Crotty
Oct 5, 2015, 5:07 PM

Hi David: The overall conclusion from the two citation studies (citations to non-elite journals and older articles) was that there is a clear and persistent spread of attention. From smaller number of journals in each field to more journals, and from recently published articles to all articles.

The collection of changes that have occurred over the last two decades have made it easy for researchers to find almost everything. Things that are easy to do, people do a lot more of. This is not specific to scholarly communication or researchers. We all do this and we do this in all parts of our lives — eg, consider how many more messages you write every day, how many more facts you look up, how many different sources you get your news/information from, the diversity of activist campaigns you add your signature to, etc.

A study that considers a much larger number of journals per category as “elite” would in effect come to much the same conclusion about the change in citation patterns — citations (and implicitly attention) are spread out more than they used to be and that this spread is growing.

anurag

By Anurag Acharya
Oct 5, 2015, 7:05 PM

The other common request from advancement committees is publish X number of papers in journals with an Impact Factor above Y. It would again be interesting to see the results viewed in light of the the number of journals each year that had surpassed a few different threshold levels (Y).

By David Crotty
Oct 5, 2015, 5:26 PM

Hi Anurag,

Thanks so much for answering questions here. A few times I have performed a Google Scholar search and have found intelligent design creationist blogs listed in the search results. I have reported these in the past. Is there an organized effort to keep these non-scholarly sources out of Google Scholar?

Michael

By michaelmhoffman
Oct 5, 2015, 9:17 PM

Hi Michael: Scholar uses automated mechanisms to determine what should be included in the index. As it must – to be able to index scholarly articles web-wide. It usually works pretty well.

If you notice items that you believe shouldn’t be included, do let us know via the feedback form that you can find at the bottom of search result pages. We update indexing algorithms fairly frequently. If you already have sent us feedback, thanks a lot!

anurag

By Anurag Acharya
Oct 5, 2015, 10:43 PM

Thanks Anurag! I appreciate your participation here and, indeed, all your team’s work to improve access to scholarly articles.

By michaelmhoffman
Oct 6, 2015, 8:01 AM

Hi Anurag: Is there information available about how the Related Articles algorithm works? I ask because I have developed a procedure that uses RA to cluster and rank articles by conceptual closeness. Is it some form of term vector similarity?

I developed this procedure for the Energy Department’s Office of Science. Metaphorically it finds concentric circles of articles of different conceptual distance from a core, so I call it the inner circles method. It may not provide serendipity, but it does find closely related clusters that do not use the same keywords. These can be pretty surprising.

By David Wojick
Oct 6, 2015, 7:22 AM

Hi David: That is a good point. Related articles can definitely jump across keywords and concepts (as it happens, Scholar’s related articles can also jump across language the articles are written in). However, I am not sure we should consider that serendipitous – a user clicking on “related articles” link for a given article usually comes with a search question in mind. She is already looking for something and what we find there is, well, related to that. While this is definitely useful, and related articles get used a lot, there is a yet broader need for things I did not know to look for.

anurag

By Anurag Acharya
Oct 6, 2015, 11:44 AM

Anurag: I would need a clearer concept of serendipity before I could design a system to provide it. I suspect that serendipity is actually a wide variety of different concepts mashed together, each of which might not be that hard to achieve individually. Perhaps a good taxonomy is the first step. (I have a taxonomy of 126 kinds of confusion causing factors. See http://scholarlykitchen.sspnet.org/2013/02/05/a-taxonomy-of-confusions/) As a concept analyst, serendipity strikes me as being as messy a concept as intelligence is, maybe messier.

That said I repeat my initial question: Is there any information available about how the GS Related Articles algorithm works? As an engineer, I dislike using a black box for my clustering and ranking procedure.

By David Wojick
Oct 6, 2015, 1:01 PM

Hi David: a public description of the Scholar related articles algorithm isn’t currently available.

anurag

By Anurag Acharya
Oct 6, 2015, 2:27 PM

I know Anurag; I just wanted you to say it. A bit of fun.

By David Wojick
Oct 6, 2015, 3:31 PM

Dear Anurag,

Thank you for taking questions here. Mine is the following:

As you know, the scientific community has become utterly dependent on Google Scholar for one of the most important things that we do: finding the literature that we ought to be reading. With its depth of full-text indexing and its sophisticated search, there is no adequate substitute for Google Scholar, subscription services included. Obviously you should be very proud of this — you’ve made a huge improvement in the efficiency with which science is done.

But many of the other Google services that my colleagues and I have enjoyed or even depended on for our scholarly work — Google Wave, Google Reader, Google Knol, Google Buzz, and so on — have been canceled. These losses were all minor inconveniences. (Well, except for Wave, which was our preferred system for discussing our projects and keeping an organized record of these conversations, and which still lacks an adequate replacement).

By comparison, the loss of Google Scholar would be an absolute disaster. What assurances do we have that Google (or Alphabet?) will not pull the plug on Google Scholar at some point in the relatively near term?

And if Google is able to offer any such assurances, to what degree to these depend on the whim and fancy of the major scholarly publishers who current grant you access to index their full text? This later question strikes me as a real and serious concern, given that some of these publishers, most notably Elsevier, are heavily invested in competing scientific search technologies.

Thank you in advance for your thoughts,
Carl

By Carl T. Bergstrom (@CT_Bergstrom)
Oct 7, 2015, 1:14 AM

Dear Carl: thank you for the kind words! Much appreciated.

Let me say this as simply and as unequivocally as I can. Scholar isn’t going anywhere.

As a service, Scholar is used widely and continues to grow. Organizationally, Scholar team is the largest it has yet been. Scholar is popular internally — as you know, so many of us come from academia.

We continue to work closely with our publisher partners. We have worked together on many indexing improvements and features. The transformation in scholarly communication that I described in my talk is something all of us, publishers, libraries, aggregators and search services, working together have achieved.

cheers,
anurag

By Anurag Acharya
Oct 7, 2015, 12:40 PM

The Scholarly Kitchen

Guest Post: HighWire’s John Sack on Online Indexing of Scholarly Publications: Part 1, What We All Have Accomplished

John Sack

Discussion

Innovation Showcase Highlights Cutting-Edge Publishing Solutions

View photos from the 46th Annual Meeting!

John Sack

Related Articles:

Next Article: