Picture of Farnsworth House by Mies van der Rohe
The Farnsworth House by Mies van der Rohe offers a primer in information design

Over the last few months, the Scholarly Kitchen has featured a number posts exploring the new world of data-driven tools, ways to better enable reader discovery (here and here also), to identify emerging areas of scholarship and to customize content to meet reader needs. Each piece makes a compelling argument for taking advantage of digital technologies for the benefit of users. But each suggestion raises as many questions as it answers. As we explore these services, we must ask basic questions about their utility, their trustworthiness and what impact they will have on the creative process.

Discovery versus filtering: are we barking up the wrong tree?

I’m not sure if this is a semantic argument and that different people mean different things by “discovery”, but frankly, finding stuff to read is not a major problem for most researchers. We live in an age of abundance with powerful search tools at our fingertips. The days of scouring through the latest print edition of Current Contents and heading over to the library with a copy card are, thankfully, long over. But those stacks of unread papers (either living as printouts on a researcher’s desk or as PDFs on their hard drive) have grown exponentially, and given the increasing workload foisted upon researchers, this overload has become an unconquerable challenge.

Every time I’ve discussed recommendation systems with researchers (“…more articles like this…” or “…readers who read this also read this…”), the most common response is, “no thanks, I already have enough to read.” I don’t know of any researcher who wants more discovery, who wants an even bigger pile through which to wade.

What they want is not discovery, but instead filtering mechanisms. Find a way for me to reduce my pile of papers to read, help me know which papers to prioritize. Most researchers are experts in collecting information about what’s going on in their fields. They don’t need help with this. What they need is help in processing the overwhelming amount of information that is available.

This is one reason why journal brands persist, and are perhaps more important now than they ever have been. Forget about the Impact Factor for a moment–if you know your field, you have a very clear sense of which journals are most relevant, and the relative quality and importance of the research they publish. It’s an imperfect system, but one that helps readers prioritize. Journal X only peripherally touches on my research and publishes a lot of low-level, incremental work. Journal Y is the key place where research in my sub-specialty is available, and they have very high standards. So papers from Journal Y go to the top of the stack and X down to the bottom.

Perhaps instead of “discovery”, we should be emulating Mies van der Rohe and thinking along the lines of “less is more.”

Can we trust these services?

If you’re like most people, when you want to learn about a new subject, you hop into your web browser and do a Google search. You assume that Google will give you search results that are trustworthy and that best reflect the nature of the question you’re asking. But given Google’s secrecy around their search algorithms, can you trust those results?

The Wall Street Journal made a Freedom of Information Act request to the Federal Trade Commission (FTC) to see the FTC’s staff report that recommended filing an antitrust lawsuit against Google. The FTC inadvertently sent the newspaper an unredacted version of the report, and the revelations are startling.

First, the report showed that Google illegally took content from competitors such as Yelp, TripAdvisor and Amazon and used it to improve the content of its own services. When competitors asked Google to stop doing this, Google threatened to delist them from search results.

That abuse of power is scary enough on its own, but what’s really relevant here is evidence of how Google cooked the books to favor its own sites:

In a lengthy investigation, staffers in the FTC’s bureau of competition found evidence that Google boosted its own services for shopping, travel and local businesses by altering its ranking criteria and “scraping” content from other sites. It also deliberately demoted rivals.

We are in the midst of an era of market consolidation. The large commercial publishers are gobbling up any interesting startup, many of which are the very companies that we’re turning to for help with content discovery and filtering.

The question then must be asked–do you trust commercially-driven companies to play fair? How long did Google’s “do no evil” pledge last after their IPO? Would it surprise you in any way if the recommendations coming out of a service owned by Publisher X favored articles in journals from that same publisher? Can these tools only be trustworthy if they are run by neutral third parties freed from profit motives (think CrossRef, ORCID, etc.)?

There are also important questions to ask about whether algorithms can truly determine trustworthy information. One really interesting recent development is seeing Google moving away from automation of search results and toward good old fashioned editorial oversight. For medical information, Google is essentially admitting that popular and well-linked information is not the same thing as accurate information. They are proposing to use a panel of experts to curate information rather than crowdsourcing and relying on data collection.  So does that mean that automated recommendation systems will eventually come around to employing editorial boards and peer review for suggestions?

What does “spoonfed” information do to the creative process?

While Roger Schonfeld recently wrote about the idea of building “serendipity” into automated discovery tools, I remain somewhat unconvinced that intellectual leaps can be pre-programmed at the press of a button and that the information assimilation process can be approached by broad, generic tools.

For creative endeavors, whether choreographing a new dance or making a scientific breakthrough, we rely on the vision of the individual. Joe Esposito suggests that books could be improved by paying attention to reader data and tailoring content based on their usage patterns. Personally, I find this concept somewhat horrifying.

If you rely on user data and focus groups, and your goal is to appeal to the broadest section of the bell curve, then you will much more likely end up with Two and a Half Men rather than Breaking Bad. By many measures, Two and a Half Men could be seen as one of the most successful creative endeavors ever. But when one looks back on the current “golden age” of television, I suspect that it will not likely enter the conversation.

Further, the main reason I avoid Google these days  is not so much concerns about privacy as it is wanting to stay out of the “filter bubble”. Google’s algorithms are trained over time to give you answers that are like the things you have clicked on in the past. Google’s goal is to anticipate what you want to know.

That may make sense for something like shopping (David prefers skinny ties and metallic colored combat boots so let’s show him more of those), but it’s harmful when you’re trying to break new ground or learn something new. I don’t want my previous behaviors reinforced, I want my beliefs challenged. I don’t want to see research that’s like what I’ve already read or what I’ve already done, my job is to make a breakthrough into something unknown.

There is a homogenization of culture that comes through feeding everyone through the same algorithm. If everyone working on Hirschprung’s Disease is fed the same discovery cues pointing to the same papers, then this limits the scope of approaches taken and potentially slows research progress. The job of a researcher is to make new connections. We know that there is tremendous power in interdisciplinary work. Researchers who are not deeply invested in a field’s dogma are often able to bring in new viewpoints that would not have occurred to someone thoroughly enmeshed in that field.

If we leave researchers to find their own roads, does that increase the number of roads taken, and does that make all the difference?

David Crotty

David Crotty

David Crotty is a Senior Consultant at Clarke & Esposito, a boutique management consulting firm focused on strategic issues related to professional and academic publishing and information services. Previously, David was the Editorial Director, Journals Policy for Oxford University Press. He oversaw journal policy across OUP’s journals program, drove technological innovation, and served as an information officer. David acquired and managed a suite of research society-owned journals with OUP, and before that was the Executive Editor for Cold Spring Harbor Laboratory Press, where he created and edited new science books and journals, along with serving as a journal Editor-in-Chief. He has served on the Board of Directors for the STM Association, the Society for Scholarly Publishing and CHOR, Inc., as well as The AAP-PSP Executive Council. David received his PhD in Genetics from Columbia University and did developmental neuroscience research at Caltech before moving from the bench to publishing.


36 Thoughts on "Discovery Versus Filtering and Other Questions Raised by Data-driven Services"

Interesting stuff. An observation though. It seems you’re arguing filters are good (journals) and filters are bad (Google). This might be a case of needing new language.

Or might it be a case that the old world and the new world are still in collision, we’re using old equipment (theoretically and practically) to look at new phenomena the equipment are not perhaps built for.

I’m arguing that filters are good, provided that they are not perverted toward purposes other than serving the needs of the user. If a journal only accepted articles based on its ability to sell ads against them, or because they help prop up another business venture, then that’s a journal I won’t read.

Also there are visibility issues. No one reads/searches just one journal. But online behavior tends to coalesce around one dominant service (one Google, one Facebook, one eBay, etc.). If the majority of users all see the same results because they all use the same tool, that may harm diversity.

But if journals covering a sub-specialty are where people in that sub-specialty go to for their filtered content, aren’t they all using the same tool? Martin’s point seems to still stand.

I suppose to me the difference is the single path approach versus the many roads approach. If you have multiple journals and each researcher weights them independently and likely uses a different combination of them, you get a varied set of results, rather than having all using the same service giving the same consistent result.

Can you give an example of a single path approach to knowing what is going on in a subfield? I am having trouble imagining it. For that matter just browsing the top journals will not do it. I suspect that no one actually keeps up systematically.

As you’ve pointed out, there are very different complex sets of behaviors we’re talking about here–exploring a new area versus keeping up with your field as just one distinction. But people are creatures of habit and tend to set up routines for their weekly/monthly catch-up with the literature sessions. I know many biomedical researchers whose primary methodology is to do a PubMed search for their keywords and look at the results chronologically, essentially scrolling down until they see articles they’ve already read.

But as you note, it’s part of a broader strategy, reading electronic tables of contents via email, hearing talks at meetings or from visiting researchers, perhaps even thumbing through the print copy of their society’s journal. The question isn’t whether anyone does single path approaches now, the question is whether it is a goal we want to push toward and something valuable that we should spend our development funds to create.

We know that the internet tends toward network effects and consolidation (one Google, one Facebook, one eBay, etc.). Are we sure that we want to drive the dominant information seeking structure of the day into the creative realm as well? Does having one dominant algorithm making recommendations lead to a replacement of the scattered, multiple paths in use now, and is this a beneficial thing?

My inner circles algorithm may solve the problem, or at least make progress. That is the goal.

I’m not sure what you consider to be “the problem” and what would be considered “progress”.

The problem is finding a single path approach to discovering what is going on in a subfield. At present it is extremely difficult to see what is going on, requiring many laborious paths. Progress means removing a significant amount of that difficulty.

Terms like discovery and filtering are far too vague and general for a useful discussion of what I call the logic of searching scholarly content. For search engines the primary questions are what does it look at, what does it collect for a given query, how does it rank that collection and how does it present the results? Every step involves filtering and all of it is part of discovery. Moreover, we use a variety of tools and methods that are not search engines, such as authorship and citations, in unpredictable combinations. In short the logic is very complex.

I published a set of critical review journals. They all had very high IFs. I asked various folks why and what I could do to make sure they maintained their rankings. They all said, Keep a good person as EIC, maintain a strong international editorial board and select topics of high interest and reviewers who were leaders in the field. Additionally, to a person they all said that they read the reviews as a filtering process to save time. It was mentioned that there was a lot of stuff of little value out there and to have someone parse the wheat from the chaff was a real time saver.

I tend to think that a good review journal is much better than an algorithmic search engine. The problem is that there are not enough review journals, and it really is rather easy to use key words to make an indiscriminate list of articles.

Review articles and search engines are solving different problems, so neither is better than the other.

No idea why WordPress has decided to put my picture next to David Wojick’s comment above, but to be clear, he is the author of it, not me.

I like to think you’ve been editorially enhanced. I consider this a “value add”.

This is a critical discussion and is worth more than one serving out of the kitchen along with comments from the diners or a one day exchange with the “cooks”:

a) There is increased pressure to “publish” which often means leaner articles, often in different journals and, probably, different publishers. Thus we have an intellectual easter egg hunt to find all the pieces without a reliable map of the territory. And one must not forget that most of the materials have an increasingly shorter half-life, particularly in the STM area.

b) STM publishers all have their own maps and search engines that are confined to their territories while the pieces are scattered far and wide. Even if there is a library capable of pulling in all relevant journals for all their intellectual scavengers, the system still has boundaries which require crossings

c) There are “reviewers”, usually consulting firms, that troll the area and provide a service by playing spotter for obvious reasons; and there are “Watson” type search engines that can scan and interpret material from the title down to words in the text as well as having the ability to compose narrative. Much of this type of access is based on the ability to pay and thus not readily available to most who produce these articles or artifacts.

There are changes in the area, some are only weak signals and thus there is no timeline and no final form. The multicolored forms of OA are like bright colored distractions in the field. Thoughts on near and long term changes?

Are researchers really using Google over the ProQuests and EBSCOs of the world? Are the relevancy rankings in such databases less likely to ‘learn’ to limit discover results over time?

Nearly every journal I’ve looked at sees between 40% and 60% of its referral traffic coming from one form of Google or another.

Definitely. Of course, many researchers don’t have any access to the ProQuests and EBSCOs! But, even those that do regularly choose Google, Google Scholar, etc. as their database searching tool of choice – especially as a starting point.

For this author, absolutely, but differently for discovery and filtering. Google Scholar has far broader reach than the subscription-based services that I have access to (Scopus, WoS, and EBSCO). More importantly, GS’s algorithms sometimes do bring those nearly serendipitous findings, where even though I don’t have the precise search terms, if I’m willing to poke around a bit, I usually discover more than I knew to search for. For the subscription services I’ve tried, I have to know what I’m looking for fairly precisely to find it (specific author, exact title, or exact keywords). “Related articles” in GS often turns up something of interest. Further, in my area of interests (water pollution ecology), the grey literature is an important information source. This material is not found by the subscription based services. Then there’s the convenience of not having to futz with tedious logins or VPN IP authentication to get into the Scopus etc. services
However, a generous description of Google Scholar’s filtering functions would be “limited.” No search within searches, and no sorting by anything other than how Google determines the sort shall be (except the last year). Amenities such as exports to bibliographies, saving searches, are crude or missing. For these sorts of filtering and refined searches, Scopus does a lot better. WoS has similar depth but is less functional than Scopus. After one exploration, I never bothered to use EBSCO again.
I do share David Crotty’s concerns about getting too GS dependent. Google Scholar was once prominently displayed on Google’s home page, but now one has to drill down a ways to find it. This makes me wonder how committed Google is to maintaining a service that provides no revenue. Once upon a time there was a neat feature called iGoogle for setting up personalized home search pages, until one day Google got tired of it and turned it off. With the introduction of $750 expedited peer reviews, maybe GASPs (Google Article Search Placement charges) will join APCs in the list of author publishing costs.
Thanks for another thoughtful posting and discussions.

Thanks for raising some important issues here David. The issue of how much we can trust these sorts of services is key, and extends, I think to library discovery tools as well – something I’ve heard a number of librarians express concern about.

There is much here to continue pondering. Thank you.

Your observation that there is too much to read is exactly why we need better discovery services. The problem is not that there is too much material of equally high value arriving in the researcher’s to-read queue. Rather, the problem is that existing discovery practices and services (all of which serve to some degree as filters also) are not sufficiently able to help the researcher distinguish what is of greatest value.

I believe personalization can help with this challenge to a significant degree without overlying sacrificing serendipity. But it would be interesting to think together if there are other approaches to addressing what I think we agree is the underlying challenge here.

This speaks to the semantic question–does “discovery” just mean “finding stuff” or does it mean carefully separating the wheat from the chaff.

There is always some of both going on. Any list of search or anticipatory results is in some way limited relative to the index from which it is generated and then presented in some order. How effective these are at separating the wheat from the chaff is a matter of design.

I’m reminded that I once suggested that we define success of a discovery service with two factors: “first, the share of needed items that are discovered, where 100% is optimal. Second, the ratio of items discovered that are needed to those that are not needed, where a higher ratio is better.” Today I might replace “needed” with “valuable” but otherwise I think this remains a reasonable definition of success. (Quoted in http://www.sagepub.com/repository/binaries/pdf/improvementsindiscoverability.pdf )

False positives are easy to measure. What about false negatives: articles the researcher would have liked to have seen but were not returned by the discovery service? If you get up to 100%, surely you’re excluding valuable items.

These are aspirational definitions. I agree with you – it’s hard to imagine squeezing out all inefficiencies!

Note that concepts like needed and would like to have seen imply artificial intelligence, which is what makes this all so difficult (and prone to hype).

Maybe this is a special case but there is at least one category where it is easy to measure false negatives – that is searching for a known item. In our log analysis, approximately 50% of the searches done by our users are known item searches, which is pretty consistent with what the literature reports in general. Using a locally developed system that logs and tracks usage paths, we are able to compare how well each target system answered this need (though we admittedly rarely go past the first page of results in a target to check since users don’t either!) and also whether users seem to persist past one or two targets if they don’t find what they want. Overall, I’d observe that it is disappointing how often even a very specific search with multiple title keywords and an author last name doesn’t result in the item being retrieved. I recognize this doesn’t address topical searching but I have to admit that when a system doesn’t succeed with known item retrieval that is direct matching it is challenging to have faith in its keyword/topical search and relevancy ranking.

Regarding “reader needs,” one (this one, at least) is reminded of Columbia Pictures’ Harry Cohn: “The audience always knows what they want. Right after they’ve seen it.” As for “filtering,” it seems better suited for coffee than for scholarship. And, if less really is more, then I’m a buyer for all the Tiffany lamps at 50 percent less than their market value; now when can I pick them up?

Thought provoking indeed. I wonder if the rise in post publication review (e.g. Merlot) might eventually become a countervailing force.

A most useful summary piece – sending it along to our library’s discovery and delivery study team. How to “keep up” and “what can I find on xyz topic” seem significantly different use cases to me. Using my experience in the past month, I’m often overwhelmed by the inflow of things to read but I also retrieved pretty much everything that exists on a topic I was writing a paper on and it turned out there wasn’t much (of course as a scholar, discovering an unexpected and significant gap in the literature was actually quite a find!). Continuing to develop a more robust understanding of the different kinds of discovery seems to have the potential for clarifying when stemming the tide is valued, when expanding the retrieval results would be useful, and when enabling comprehensive discovery is crucial and developing different functions that serve each well. I’ll add to the list of user needs as well – once retrieved, read, and filed … discovery within one’s personal collection! Managing a personal collection can be as challenging as managing the literature at large. Thanks again!

Finding a gap is fun indeed, Lisa, and it can lead to a proposed project filing the gap. I am just writing one up. When it comes to being comprehensive I find Google Scholar’s Related Articles search very useful. It uses term vector similarity which means it will find closely related articles that do not use your search terms. In fact I have developed an algorithm that finds all and only those articles within a series of distances from a topic, using a complex combination of hundreds of related article searches. I call it the Inner Circles method, but only a computer can execute it.

This isn’t only an issue with secondary lit. searches but for digital cataloguing of archival material– manuscripts, print, visual and material culture. In the same way that nineteenth-century library and archival arrangements reflected contemporary hierarchies of knowledge (edited versions of letters excising any women’s correspondence, for example, as “not of general interest”), and thus shaped scholarship produced from those arrangements, the way we are arranging information reflects particular twenty-first century values– and the scholarship resulting will, too. Google’s creepy control over hierarchies of information value being just one example.

Comments are closed.