Data-mining the Library

Information in a library is of two kinds — there is the content, the collection, all that stuff that resides in books and journals and special collections; and there is the information about that content, the metadata: information about where things are located, how they relate to other things, how often they circulate (but, rarely, for privacy reasons, about who actually accesses and reads the content). It’s that latter kind of information, the metadata, I am interested in, as it may provide value to certain organizations, value that libraries may seek to tap.

I have been thinking about library circulation for some time, but my interest grew when I began to study PDA last year, as eliminating books that don’t circulate is one of the reasons librarians are interested in PDA in the first place. PDA programs implicitly ask two questions: Why would libraries want to acquire books that don’t circulate? and, Why would publishers want to publish books that nobody reads? There is a risk with questions like this in that they can make scholarly publishing into a popularity contest. After all, if the measure of a successful scholarly book is how many copies are sold and how often that title circulates in a library, then the big trade publishers would become the model for all publishing, drowning out the specialized, intellectually serious work that is the business of a research university. But surely there is a middle ground between bestsellerdom and the totally obscure. Information about how books circulate in libraries would help publishers evaluate their lists and provide guidance for future editorial acquisitions.

Publishers do, of course, have some limited information coming back to them from the marketplace. No publisher fails to study the Amazon rankings of a title, for example (some authors do nothing else), and there are services that provide a modicum of information about the sales of books in retail outlets. Scholarly publishers have a special problem, however, in that their titles are sold disproportionately to libraries, so the absence of circulation data affects them more seriously than it does trade houses.

You would expect to be able to go online and look all this stuff up, even if some of it resides behind a paywall. But you can’t: there is no place to go to get aggregate data on library circulation. So, for example, Stanford University Press published a book called “Monopolizing the Master: Henry James and the Politics of Modern Literary Scholarship,” by Michael Anesko, a book I am choosing at random (though it sounds interesting). Is it not reasonable to ask what libraries purchased a copy and how often it has circulated? You can get an answer to the first question by looking at WorldCat, but the second question is unanswerable at this time.

Individual libraries study their circulation records carefully. I have previously cited on the Kitchen a rigorous study done at Cornell, and I imagine many libraries have something of this kind in hand; librarians being librarians, one assumes that such studies get passed around informally. But there is a place for a full service, one that aggregates the circulation data, properly anonymized, of all library collections, and that can generate management reports for interested parties.

So let’s imagine a new library service called BloodFlow, which sets out to aggregate the circulation records of all the world’s libraries. The libraries themselves would have to be tagged by type (e.g., their Carnegie classifcations or by using a different taxonomy) so that one could distinguish between the major ARLs, liberal arts college libraries, the libraries of community colleges — and, of course, school, public, and corporate libraries. Circulation data from all these libraries would be uploaded to BloodFlow, which would aggregate the data in a form that allowed it to be packaged according to the needs of any particular user. For example, a librarian at the University of Michigan may contemplate whether to purchase a revised edition of a book first published by Rutgers University Press 10 years ago. What is the demand among research universities for this title? If the circulation in the aggregate is strong, Michigan may decide to purchase the book. Or a librarian at a public library may look at the circulation records for a book that is already in print from Palgrave Macmillan. But if the records show that virtually all of the book’s circulations were at the top ARLs, the librarian may pass on that title as not a good fit with a public library’s collection.

Publishers would make different uses of this data. Should I bring a book back into print? Let’s check the circulation records. Or, we have a submission here on Byzantine studies; how can we assess the market opportunity? Publishers would also be interested in trends: Are books in Women’s Studies circulating more or less strongly over the past decade, and how do these circulation records compare to that of collections as a whole? Or how about economics, or physics? Once you begin to study data like this, the number of new questions that arise can be mind-boggling. Mix a curious mind with a large data set and the tools to manipulate it and suddenly you find that you have given birth to a new Edison or Tesla.

One way to get this service to work would be to set up a membership organization — the BloodFlow Partnership. Any library could join, with the following conditions: there is a membership fee, scaled by size and type of library, and the library must make all its circulation records available to the partnership. A member would then have unrestricted access to the data, including the report-generation feature. (An interesting question is whether information about the reports requested — the meta-metadata — would be part of the service as well.) Non-members would have to pay a fee, which would once again be scaled by type and size. For whatever reason, Colby College decides not to participate, but it subscribes to the service; the price for Colby, however, is far less than that paid by Oxford University Press and Simon & Schuster. Thus the business model is a combination of membership and toll-access publishing. Ideally, the circulation records would be available in real-time (How many copies of “Administrative Law: The Informal Process,” by Peter Woll and published by the University of California Press are circulations right now?), but this may be hard to achieve technically. The more granular the data, the better, but even annual circulation figures from libraries without the technical means to publish an API to their circulation records would have some value.

There is a corollary to this argument, and that is that with more and more libraries getting into the publishing business in some way, usually with various kinds of open access services, there is an unanswered, even unasked editorial question: What is the right kind of content for a library to publish? In my view, the best new publishing enterprises focus on new and growing content areas. A library that seeks to publish material in European history must contend with the program at OUP; a library interested in American history will have strong competition from Harvard University Press; and, most obviously, a library interested in STM journals will find such organizations as Elsevier, Springer, and Wiley Blackwell fiercely defending their turf. But aggregate library metadata is another matter. This information is proprietary to libraries; only they have access to it, only they can publish it. It’s a great competition position to be in. The beautiful irony is that the paying customers for such services will in part be traditional publishers.

Joseph Esposito

@josephjesposito

Joe Esposito is a management consultant for the publishing and digital services industries. Joe focuses on organizational strategy and new business development. He is active in both the for-profit and not-for-profit areas.

Discussion

15 Thoughts on "Data-mining the Library"

Joe, this is a wonderful post, but I want to play devil’s advocate for a minute. Should a publisher really care who is checking out their books, or should they only care that they are purchased by libraries? Most book selection in university libraries is done through purchasing plans set up by book vendors, so the decision to acquire a book is based on whether it fits into a particular classification (e.g. American history published by a university press under $50). In this sense, a publisher already has a good idea of how many copies will be sold. What cannot be known, however, is when a scholarly book jumps from the world of academic readers and into the realm of general readership. Often this is facilitated by getting a few excellent book reviews, an NPR interview by Terry Gross, or being recommended by a diva celebrity (i.e. Oprah). In these cases, a publisher cannot predict these events. Success begets success and a 200-print run hardcover suddenly becomes a million copy softcover.

My point is that most publishers have a good sense of the academic marketplace, which is dominated by small runs dictated by automated purchase plans. Circulation data cannot help you here. Indeed, it would require waiting even longer than the accumulation of citation data. What cannot be predicted are these small, unlikely events that propel a title from the academic to the broader reader community. Many editors like to believe, in hindsight, that they knew this was coming, but in reality, the fate of these stars is written in the heavens.

By Phil Davis
Jul 24, 2012, 10:36 AM

Should a publisher really care who is checking out their books, or should they only care that they are purchased by libraries?

In the days before PDA, that may have been all a scholarly publisher needed to worry about. But PDA connects reading with buying, which means that in an increasingly-PDA-driven library market, publishers have an increasing need to worry about whether their books will actually be used, not just whether they will fit into libraries’ patterns of speculative purchasing.

By Rick Anderson
Jul 24, 2012, 8:46 PM

Princeton U.P. has a classic example of this on its backlist. When first published in the Bollingen Series, Hellmut Wilhelm’s translation of the I Ching sold maybe 500 copies a year, as it was considered a fairly esoteric text. Then in the late 1960s it was discovered by hippies, and suddenly the press was shipping a thousand copies a month to Haight Asbury. Rarely has a book leapt so swiftly from scholarly obscurity to mass popular success. (The same thing happened later with Joseph Campbell’s “Hero of a Thousand Faces,” originally published in 1949, went it got onto the NY Times best seller list after Bill Moyers held a series of interviews with Campbell on PBS.)

By Sandy Thatcher
Jul 26, 2012, 5:01 PM

This is indeed a question of risk. “Why would libraries want to acquire books that don’t circulate? and, Why would publishers want to publish books that nobody reads?” One could keep going: “Why should a researcher/writer produce a manuscript that may not be read?

By bcohen99
Jul 24, 2012, 11:15 AM

Certainly in the Humanities and some Social Sciences, libraries have traditionally purchased books that are not immediately borrowed, but might have the first circulation a few years later (I’ve done micro studies in my own library and my own subject areas which show this). In specialized sub-fields this might be a function of when seminars are taught or when a new specialist is hired or when the next dissertation on a second tier author or topic is written. Such purchasing for the future might be a luxury libraries can no longer afford (especially since most academic libraries are buying fewer books each year). PDA might provide a way of waiting until items are in demand before libraries purchase them, but it would not surprise me to see far few monographs published. Will publishers want to invest money in titles that show little or no return for a few years? And what will be the impact on tenure decisions in fields that still expect or demand monograph publication for the successful candidate?

By Kevin Mulcahy
Jul 24, 2012, 5:06 PM

I’ve borrowed books and journal issues from my university library that have not been borrowed (or even opened!) since they were bought in the 60s. I’m glad they bought them.

By DR
Jul 25, 2012, 2:30 AM

Of course, with POD available, a publisher does not have to invest as much capital up front in inventory waiting for future sales. Indeed, PDA would not have worked well in the days before digital printing because waiting for a few years to order might have meant the book was out of print by that time!

By Sandy Thatcher
Jul 26, 2012, 4:56 PM

Hello Joseph
Do you know about smartsm? http://www.smartsm.com/ It’s used in quite a few of the public library services in the UK and taps into a few of the areas you touch upon. They’ve produced a whitepaper on the areas it covers.
Gary

By garygre
Jul 24, 2012, 5:53 PM

Thank you for the link. I did not know about Smartsm.

By Joseph Esposito
Jul 24, 2012, 8:01 PM

Intriguing idea, Joe. I expect librarians would have significant privacy concerns, though. They are rightly fierce about guarding patrons’ data. I also wonder whether collective data could be tailored enough to satisfy individual institutions’ particular needs. You’d need some excellent filters to get good matches among colleges/universities. I’m not saying it’s impossible, just wondering.

By Jennifer Howard
Jul 25, 2012, 10:40 AM

Privacy is one of the two big questions. The other is will the dogs eat the dog food–that is, will anyone pay for these reports? I don’t know the answer to the privacy question. It is one of the things that would have to be studied carefully. I hope it’s clear that I was referring to fully anonymized data. That is, you would know that “Crime and Punishment” was placed on the reshelving cart at Clemson U., but you wouldn’t know that I put it there.

By Joseph Esposito
Jul 25, 2012, 12:41 PM

I’m intrigued by the notion of a central database of library circulation information, but knowing the library profession as I do, I would predict that by the time the standards were created and the data definitions unified across the several library system vendors to aggregate comparable data, libraries will be circulating a lot less physical volumes circulating than there are now. It’s already a long term downward trend. What’s ironic to me is that the circulation (use) data for academic ebooks is already in the hands of the publishers or their distributors and where it’s available it’s not very useful.

By Ned Quist
Jul 25, 2012, 1:11 PM

Aggregated library circulation data would have value, in much the same manner that aggregated holdings data in WorldCat does. However, there are numerous obstacles that would have to be overcome before meaningful data of this sort could be compiled and relied upon for decision-making.

A major problem is the lack of standards for what counts as circulation. Libraries have different policies. For example, should renewals be counted? Some libraries allow unlimited renewals absent a recall request, while others impose absolute limits on renewals, meaning an item used for long-term study has to “re-circulate” periodically. Is interlibrary use counted the same as use at the home institution? Is internal use counted (placement on a new book shelf, binding/repair, temporary loan for large-scale digitization, etc.)? What happens to the circulation data for items that are withdrawn from the collection, which occurs for both high- and low-use titles, but for entirely different reasons?

There are also interpretation issues. Loan periods vary by constituency and intended use. Should a two hour reserve loan to an undergraduate for exam preparation count the same as a multi-year loan to a graduate student using a book to write his or her thesis? What will we know about how much of an opportunity a book has had to circulate? For example, can we distinguish books that are in non-circulating locations, or that are lost or missing? These should not be penalized for lack of circulation. More and more items at ARL libraries are stored in non-browsable, off-site facilities. Can their circulation be fairly compared with titles in browsable stacks? Will we even know how long a book has been in the collection? Publication date and acquisition date do not necessarily match. (See Jeff Luzius, “A Look at Circulation Statistics” Journal of Access Services, Vol. 2(4) 2004, pp.15-22 for a study of some of these issues.)

Aggregation will smooth out some of the anomalies, but won’t eliminate them.

Data recording and retention policies also vary. Some libraries simply record a running total of circulation transactions for each item, with no indication of when the circulation took place, how long the loan was for, or what the status of the borrower was. Others have much more detailed individual transaction data, but they may have limits on how long they retain it.

There are other technical issues. One of the unresolved problems with WorldCat is the fragmentation of a single bibliographic entity into several master records, which often scatters what should be a single holdings figure among multiple records. OCLC has put a lot of effort into mitigating this problem, with only modest success. Aggregated circulation data would face the same issue.

The ability of libraries to produce usable data to contribute to the repository could be an obstacle. I can understand why publishers might be interested in paying for circulation data, especially in the context of PDA, but many libraries can’t afford the staff to generate more than rudimentary circulation reports about their own collections. In fact, I could see some libraries wanting to be paid for supplying this data, rather than supplying it for free and then having to pay a membership fee for access to the aggregated data.

I agree with Joe that there is significant untapped potential for mining library data. But it will take a sizable commitment of resources to move the state of the art forward.

Richard Entlich
Collections Analyst
Cornell University Library

By Richard Entlich
Jul 25, 2012, 2:20 PM

A number of countries, mostly European, recognize public lending rights, whereby authors are compensated based estimates of how frequently their books were loaned from libraries. I’m guessing that these countries must have something like the library circulation data that you speak of. Wouldn’t a publisher who’s unable to obtain such data from US libraries have the option of using data from other countries and making some reasoned extrapolation?

By Books, Libs, Scripts
Aug 22, 2012, 7:33 PM

The Scholarly Kitchen

Joseph Esposito

Discussion

Innovation Showcase Highlights Cutting-Edge Publishing Solutions

View photos from the 46th Annual Meeting!

Joseph Esposito

Related Articles:

Next Article: