What Would It Cost to Buy Everything?

Bags of money — Spot art from booklet “New Orleans – City of Old Romance and New Opportunity”, circa 1921.

What began as a simple question revealed itself to be a simplistic one. This is because the allegedly simple question was presented to a group of thoughtful and experienced people, who quickly demonstrated the complexity of the original question and made me doubt if it were answerable at all. That thoughtful group was, of course, the participants in the liblicense mail group, a list that is one of my media mainstays. I read it every day just after the New York Times, news about the Yankees, and the Google alerts set up on my own name.

The question was, “What would it cost if you wanted to buy everything?” The “everything” I had in mind was all formally published academic journals. That means peer-reviewed material, but it excludes open access publications (since you don’t have to pay to read them). Also excluded for this exercise are books, databases, and other content types that mostly sit outside the bulk of discourse about scholarly communications. You will see in a minute just how complicated the original question is, but I had imagined what I thought was a fairly straightforward use case.

Here is that use case. Let’s imagine a new company that seeks to create a business around data analytics of research material. To do this it needs to have access (including the rights for text- and data-mining) to all the research material. One way to get access is to seek permission from all the publishers, but as there are thousands of journals publishers, this would be arduous and take forever. Another way, which may not be legitimate (I have no opinion about the copyright issues on this, so please, spare me in the comments), would be to cobble together access to a number of research libraries; presumably when aggregated, these libraries, with some specialist outfits thrown in, would be able to provide access to everything. But that also would be hard to do, not only because of determining which libraries were necessary to put into the pool but also to have a legitimate registration with all these libraries–and to be able to download literally millions of articles from them without someone putting up a STOP! sign.

Theoretically, you could work around these problems if you had a big enough checkbook; you could simply buy (or lease, as many people reminded me) access and then text-mine to your and your robot’s hearts’ content.

And this is where it gets complicated. The first question is: When you say everything, would you please submit a list of titles? I had assumed that the number would be around 25,000, a number that gets tossed around all the time, but whose authority I should have challenged. After posing this question on liblicense, I got answers that ranged from 13,000 to close to 50,000. I am making no attempt to reconcile this, but it is clear that we can’t talk about buying everything if we don’t know what everything consists of.

Intriguingly, some respondents (I got feedback both directly from the liblicense list and privately) noted that no one would want to acquire everything because there is no institution that has programs in every field and sub-field; and beyond that there is the matter of the quality of the publications, even if they all claim to be peer-reviewed. This opens up the question of who is the audience for a large collection of journals. If the audience is an individual or group of individuals, then surely there are some publications that are simply out of bounds either because their quality is poor, the subject area irrelevant, or they simply are adding mere poundage to the already terrible burden of trying to keep up on one’s field. But if the audience is a bot, doesn’t everything change? Why stop at 10,000 journals, even if they are the best and most relevant, when you can scoop up 25,000? 50,000? Isn’t more better–when the consumer is a machine? This leads me to believe that we are likely to see a rethinking of library collections with a machine audience in mind. Yes, this is unsettling, but as anyone who has tried to build a Web site knows, the audience for content on the Internet is other machines, who spider and index ostensibly “human” expression; and Google is the Grand Poobah, the auditor with pride of place, to which all communications are at least initially intended

A second complicating question in determining how much everything would cost is that backfiles are important, too, so the cost of accessing those documents has to be added in. It’s pretty clear that it would be hopeless to try to add all this up, though across-the-board subscriptions to EBSCO, ProQuest, and JSTOR will get you a good part of the way. But to these aggregators you would need to add the backfiles of such major publishers as Elsevier and the newly constituted Nature/Springer. But some backfiles are simply not available in a digital form at all and not every publisher has a plan in place for that material. So to the already challenging question of the cost of current issues we have to add some unknown figure for the available backfiles.

As one drills into this question, the number of variables grows:

Recency. This is the question of current issues vs. backfiles.
Class of account. I began by asking about the cost to a new commercial technology company, but pricing may vary by class. Academic libraries, for example, may pay a different amount from commercial start-ups.
Size of account. Would a start-up pay the same amount as IBM or Hewlett Packard? Wouldn’t a big user pay more? This applies to libraries as well: would a small library pay as much as a big one?
Range of materials. As noted above, we don’t have a consensus on the number of peer-reviewed journals.
Extent of rights. Depending on what you want to do with the material, the cost may vary

And then there is the interesting matter of the pricing of Big Deals. As some commentators have noted, pricing can be opaque–in part because of confidentiality clauses in contracts, but also because much pricing for Big Deals was based on historical pricing for print publications. An academic library that signed onto such a package years ago may be paying quite a bit less than a library that entered the market today. There is, in other words, no single answer to the question.

Still, people tried to come up with reasonable estimates. The lowest was $13 million, the highest $85 million. The number I found most persuasive, in large part because I know the individual personally who offered it up, was a range of $25-$30 million, to pay for current issues (no backfiles) of about 30,000 journals. But in truth we don’t know. We could as easily opine on what the weather will be like in October.

So I think my hypothetical tech company is going to have to go back to making a large number of house calls. To index everything is a big, big job, and one that is not easily solved with money.

This little experiment has made me want to run another one. When librarians say that they cannot afford to purchase (or lease!) all the materials they require, what is the absolute number they have in mind? If the budget were exactly equal to what librarians want to acquire, what would that budget be? Now, this again is not a simple question. Libraries vary by institution, of course, and some of the other complications of the first experiment (e.g., different base rates for Big Deals) would still apply. But I wonder how much more money libraries would like to spend. Ten percent more? Fifty percent? Five times their current budget? When we talk about a shortfall in a library’s budget, just how large is that shortfall?

No one is ever going to fund that full budget, of course, and no one should. The number of publications available for purchase is not fixed: increase the budget and the number will rise. And we all know that prices have a way of moving up, which a larger budget would guarantee. Nonetheless, I am still surprised that for all the discussion of tight library budgets, we don’t really know what a loose-fitting one would look like.

Joseph Esposito

@josephjesposito

Joe Esposito is a management consultant for the publishing and digital services industries. Joe focuses on organizational strategy and new business development. He is active in both the for-profit and not-for-profit areas.

Discussion

20 Thoughts on "What Would It Cost to Buy Everything?"

Joe interesting thought. But, everything is for sale at the right price.

By Harvey Kane
Feb 18, 2015, 9:09 AM

Talk to the folks at the CCC about the challenges of aggregating everything. It probably comes as close to any company in having a “complete” collection, but it will be the first to admit that there are still many journals outside its orbit. The CCC also has a pilot project under way on data- and text-mining.

By Sandy Thatcher
Feb 18, 2015, 9:11 AM

What it would cost to buy everything is a largely misleading concept because libraries don’t really want to buy everything. We would like to have access to everything related to the fields of research and instruction supported by our institution but that isn’t everything. Notre Dame, for example, does not have a nursing program so there are many titles that are clearly out-of-scope for our collections. IU-Dentistry probably has little to no use for the content we have on structural engineering or high-energy physics. It’s also the wrong term because we don’t buy the most expensive journals we offer, we license them and lose all the rights that would come through the doctrine of first purchase. Of course we also are able to provide access much faster than that print world where issues had to printed, posted, received, recorded, shelved and so on. It’s a fun discussion question but the issue is access to specific content. For nearly all disciplines the importance of the issue vanished long ago; it’s a relic of a print-based distribution model where a journal issue might have one or two articles that were needed locally but we couldn’t know which of the dozen in the issue were the ones we would need nor was there a way to acquire just those articles.What we really want are discovery tools and immediate access to the specific articles our users want and we want to know that will still be true for our users next year and fifty years from now.

By Collette
Feb 18, 2015, 9:41 AM

Collette is right. But I think the point needs to be made more clearly. Why would you want to pay $25 to $30 million for subscriptions to everything. This makes no sense. The only reason a subscription to a journal makes sense is if this is the cheapest way to acquire articles. Otherwise they articles should be purchased one at a time. We have know since Bradford articulated his Law of Scatter that the use of journals follows a long tail distribution. For the few journals that get lots of use subscriptions make sense. For the long tail individual article purchase make sense. I would bet that you could get all of the articles even a large university needs for far less than $25 to $30 million.

By David Lewis
Feb 18, 2015, 1:03 PM

I think Joe mentions mining the entire corpus as a goal. I would be interested in mapping the flow of ideas, which often jump from one field to another. There are already people doing total science mapping using citations and co-authors, because that data is available. But full text for all of science could be extremely valuable. Everyone claims that they want to understand what is going on in science and this is a way to see that.

By David Wojick
Feb 18, 2015, 1:42 PM

Access to content is not enough. For efficient text and data mining, the content needs to be normalized. That is a service the CCC provided for its pilot project.

By Sandy Thatcher
Feb 18, 2015, 7:18 PM

When we talk about a shortfall in a library’s budget, just how large is that shortfall?

Speaking for myself in my library, our shortfall is measured in terms of faculty frustration. (Student frustration too, but when it comes to journal subscriptions the frustration is more often and more loudly expressed by faculty.) We’re forced by the limitations of language to be imprecise when we talk about what we “can’t afford.” Obviously, there is no journal that we couldn’t afford if we were willing to cancel everything else. Generally, I think, we say that we “can’t afford” something when we can’t buy it without having to cancel something else of equal or greater value.

The question of how much more we’d like to spend is a very interesting one, and as you note, the answer would vary from library to library. I couldn’t answer it for mine without some analysis — we do have a wish list of journals that we know we’d like to pick up but “can’t afford to” at the moment, but there are surely others we’d add to that list if we received a windfall of recurring money.

By Rick Anderson
Feb 18, 2015, 10:32 AM

A fascinating thought-experiment! Indeed, a string of thought-experiments that all circle around the idea of “value.” First, in trying to define “everything” you’ve surfaced the idea of the audience (library, researcher or robot) – “everything” for a biologist could easily leave out vast swaths of physics and philosophy – depending on the biologist, it could also leave out parts of biochemistry, chemistry, even medicine. “Everything” for the biological sciences library would be broader, but might exclude poor science or weak journals, unless they’re local and/or local-language? Your robot reader poses the most interesting idea of “everything” because of the potentially very different purpose.
So, everything depends on audience, but also on intention.

Which leads to part 2 of your experiment – “cost.” If you consider “cost” one part of the determination of “value” then things get really interesting. What if money were no object – who would really want “everything?” Who goes through ALL the pages of Google results (I just got 3.8+ million for “Alessandro Volta”)? Perhaps your robot wants and could read everything – and that for the purpose of mining only those aspects that are of value to its analytics – so even the robot isn’t reading everything of the everything.

My experience as one who uses content and one who spent many years filtering content for others’ use is this: Most readers want to know that “everything” was searched, but they only want to see what’s “valuable” to them.

By Marie McVeigh
Feb 18, 2015, 10:53 AM

But if you’re the one providing the search results, you do need access to everything. The use case Joe presents above is for a company doing text/data-mining, where the more comprehensive your source material is, the better your results will be. That’s different than the use case for a reader or a library, but there are places where “everything” indeed means “everything”.

By David Crotty
Feb 18, 2015, 11:02 AM

Agreed, generally – different users, different use-cases.

So it seems I wasn’t clear in my comment. Joe does touch on both use-cases – the library (acting presumably on behalf of groups of human users) and the robot (acting on behalf of a mining algorithm). I meant to indicate that these are interestingly different, so that I am glad he mentioned both.

I would still argue for a keeping the real-world in mind when defining “everything.” If you are the one providing search results – your cost-benefit hat is still in place in the real world. What if your users, in their hunt for what’s valuable, never identify results from content set X in that equation? The fact that you’ve searched it and served it up over and over without any takers will feed back – how much am I paying (cost) for this thing almost no one is using (value)? Yes, it’s a different question – not unrelated – but somewhat less abstract.

I am not sure that “everything” means EVERYTHING even in text/data-mining. Mining means you’re plowing through “everything” to pick out something – relationships, concept-maps, trends, linguistic development. If you’re mining words, do you need the “everything” of the images? or the author names? or the journal titles?

What about the part of everything published too long ago, too obscurely, or too recently to be included? Does that de-value your entire effort?

Even if you’re dredging every single bit of content from all inputs, here would still be a question of the value extracted from the outer edges of the “everything” and what the time and computing it takes (cost) to include content where there is little or less inflection in the results (value).

Since it is, in practical terms, impossible to have the idealized Everything – what does close-enough-to-everything look like? That depends strongly on the use-case – and that’s where the question of cost and the question of value intersect.

By Marie McVeigh
Feb 18, 2015, 11:25 AM

I would still argue for a keeping the real-world in mind when defining “everything.” If you are the one providing search results – your cost-benefit hat is still in place in the real world. What if your users, in their hunt for what’s valuable, never identify results from content set X in that equation?

Case in point Google no longer indexing full text of scholarly articles in their general Google search results, and only doing so for Google Scholar, based on users either not clicking on academic papers or clicking and then rapidly going back to the search results for something else.

By David Crotty
Feb 18, 2015, 11:28 AM

David – curious if you can point to something from Google on the issue of no longer indexing full text of scholarly articles in the general search? This indicates in a different direction for the future of GScholar than other things I have seen.

By Lisa Janicke Hinchliffe
Feb 18, 2015, 8:04 PM

Google does not publicly discuss their algorithms. But it’s fairly easy to demonstrate if you have access to a recent journal article that’s limited to subscription access. Do a search on a sentence from the abstract and then do a search on text from deep within the article. Compare between Google and Google Scholar.*

*Comment redacted to not expose a private communication

By David Crotty
Feb 18, 2015, 10:29 PM

Thanks David! I did a bit of searching to try and see what I could compare but it helps to have a more specific case to focus on. I appreciate you sharing… hopefully you don’t get in trouble for revealing! 🙂

I’m now going to sit quietly in a corner and ponder the implications that librarians – who may or may not like that their users rely on Google – are out of the loop by and large on this sort of information. Personally, I’m fine with users using whatever discovery tools they want but I’d also like to be able to help them figure out what the heck is going on with results sets when they ask!

By Lisa Hinchliffe
Feb 18, 2015, 10:50 PM

Not to mention semantic content enrichment! – see UX2 at 15:30 http://www.uksg.org/event/forum2014

By Dom Benson (@dombens)
Feb 19, 2015, 12:57 PM

Seems a better approach is to look at the budgets of top research libraries e.g. Harvard and see what they pay for journal access. From what I remember, their total budgets are less than you stated for journal access here. But your hypothetical copy might face different prices to a university library. Normally, these things are based on number of users…

By David Stern
Feb 18, 2015, 6:14 PM

Approaching this from a slightly different direction, perhaps the cheapest way to access the largest amount content for text mining would be for designates from the theoretical company to enroll in as many online degree programs as possible. This assumes that employees of the theoretical company could get into the range of programs needed for complete coverage—maybe they’d need to hire a few students to fill in some gaps—but if they could, the cost of one class in maybe a 100 or fewer programs seems likely to bring them pretty close to access to most of what’s out there. And the cost of that one online class times 100 or even 200 is likely to be far less than $13M. I’m probably missing something obvious with this proposal, but I thought I’d throw it out there as an interesting thought exercise. Maybe it would make more economic sense to consider paying for access to the education than paying for access to the journals.

By Tony Sanfilippo (@toekneesan)
Feb 19, 2015, 8:02 AM

Joe asks an amusing question. One of the reasons it is so hard to quantify is that the current system in place organizes itself in the opposite direction. That is, scholarly communications in the form of peer reviewed articles in very hierarchically ranked to, among other things, save scholars and researchers the labor of looking at articles below a certain ranking. And yes, I know this has selection bias problems but collecting everything and reporting on it would amply this problem not correct it. Only a machine with an algorithm would ever be involved in such a project.
An actual start up business would have to make a lot of calls and setup a lot of deals but this start up would like be a champion of Green Open Access.