What began as a simple question revealed itself to be a simplistic one. This is because the allegedly simple question was presented to a group of thoughtful and experienced people, who quickly demonstrated the complexity of the original question and made me doubt if it were answerable at all. That thoughtful group was, of course, the participants in the liblicense mail group, a list that is one of my media mainstays. I read it every day just after the New York Times, news about the Yankees, and the Google alerts set up on my own name.
The question was, “What would it cost if you wanted to buy everything?” The “everything” I had in mind was all formally published academic journals. That means peer-reviewed material, but it excludes open access publications (since you don’t have to pay to read them). Also excluded for this exercise are books, databases, and other content types that mostly sit outside the bulk of discourse about scholarly communications. You will see in a minute just how complicated the original question is, but I had imagined what I thought was a fairly straightforward use case.
Here is that use case. Let’s imagine a new company that seeks to create a business around data analytics of research material. To do this it needs to have access (including the rights for text- and data-mining) to all the research material. One way to get access is to seek permission from all the publishers, but as there are thousands of journals publishers, this would be arduous and take forever. Another way, which may not be legitimate (I have no opinion about the copyright issues on this, so please, spare me in the comments), would be to cobble together access to a number of research libraries; presumably when aggregated, these libraries, with some specialist outfits thrown in, would be able to provide access to everything. But that also would be hard to do, not only because of determining which libraries were necessary to put into the pool but also to have a legitimate registration with all these libraries–and to be able to download literally millions of articles from them without someone putting up a STOP! sign.
Theoretically, you could work around these problems if you had a big enough checkbook; you could simply buy (or lease, as many people reminded me) access and then text-mine to your and your robot’s hearts’ content.
And this is where it gets complicated. The first question is: When you say everything, would you please submit a list of titles? I had assumed that the number would be around 25,000, a number that gets tossed around all the time, but whose authority I should have challenged. After posing this question on liblicense, I got answers that ranged from 13,000 to close to 50,000. I am making no attempt to reconcile this, but it is clear that we can’t talk about buying everything if we don’t know what everything consists of.
Intriguingly, some respondents (I got feedback both directly from the liblicense list and privately) noted that no one would want to acquire everything because there is no institution that has programs in every field and sub-field; and beyond that there is the matter of the quality of the publications, even if they all claim to be peer-reviewed. This opens up the question of who is the audience for a large collection of journals. If the audience is an individual or group of individuals, then surely there are some publications that are simply out of bounds either because their quality is poor, the subject area irrelevant, or they simply are adding mere poundage to the already terrible burden of trying to keep up on one’s field. But if the audience is a bot, doesn’t everything change? Why stop at 10,000 journals, even if they are the best and most relevant, when you can scoop up 25,000? 50,000? Isn’t more better–when the consumer is a machine? This leads me to believe that we are likely to see a rethinking of library collections with a machine audience in mind. Yes, this is unsettling, but as anyone who has tried to build a Web site knows, the audience for content on the Internet is other machines, who spider and index ostensibly “human” expression; and Google is the Grand Poobah, the auditor with pride of place, to which all communications are at least initially intended
A second complicating question in determining how much everything would cost is that backfiles are important, too, so the cost of accessing those documents has to be added in. It’s pretty clear that it would be hopeless to try to add all this up, though across-the-board subscriptions to EBSCO, ProQuest, and JSTOR will get you a good part of the way. But to these aggregators you would need to add the backfiles of such major publishers as Elsevier and the newly constituted Nature/Springer. But some backfiles are simply not available in a digital form at all and not every publisher has a plan in place for that material. So to the already challenging question of the cost of current issues we have to add some unknown figure for the available backfiles.
As one drills into this question, the number of variables grows:
- Recency. This is the question of current issues vs. backfiles.
- Class of account. I began by asking about the cost to a new commercial technology company, but pricing may vary by class. Academic libraries, for example, may pay a different amount from commercial start-ups.
- Size of account. Would a start-up pay the same amount as IBM or Hewlett Packard? Wouldn’t a big user pay more? This applies to libraries as well: would a small library pay as much as a big one?
- Range of materials. As noted above, we don’t have a consensus on the number of peer-reviewed journals.
- Extent of rights. Depending on what you want to do with the material, the cost may vary
And then there is the interesting matter of the pricing of Big Deals. As some commentators have noted, pricing can be opaque–in part because of confidentiality clauses in contracts, but also because much pricing for Big Deals was based on historical pricing for print publications. An academic library that signed onto such a package years ago may be paying quite a bit less than a library that entered the market today. There is, in other words, no single answer to the question.
Still, people tried to come up with reasonable estimates. The lowest was $13 million, the highest $85 million. The number I found most persuasive, in large part because I know the individual personally who offered it up, was a range of $25-$30 million, to pay for current issues (no backfiles) of about 30,000 journals. But in truth we don’t know. We could as easily opine on what the weather will be like in October.
So I think my hypothetical tech company is going to have to go back to making a large number of house calls. To index everything is a big, big job, and one that is not easily solved with money.
This little experiment has made me want to run another one. When librarians say that they cannot afford to purchase (or lease!) all the materials they require, what is the absolute number they have in mind? If the budget were exactly equal to what librarians want to acquire, what would that budget be? Now, this again is not a simple question. Libraries vary by institution, of course, and some of the other complications of the first experiment (e.g., different base rates for Big Deals) would still apply. But I wonder how much more money libraries would like to spend. Ten percent more? Fifty percent? Five times their current budget? When we talk about a shortfall in a library’s budget, just how large is that shortfall?
No one is ever going to fund that full budget, of course, and no one should. The number of publications available for purchase is not fixed: increase the budget and the number will rise. And we all know that prices have a way of moving up, which a larger budget would guarantee. Nonetheless, I am still surprised that for all the discussion of tight library budgets, we don’t really know what a loose-fitting one would look like.