In a recent piece in Library Journal‘s “Digital Shift” section, Michael Kelley pointed out what looks like an alarming and growing problem:
A recently released study of e-journal preservation at Columbia and Cornell universities revealed that only about 15 percent of e-journals are being preserved and that the responsibility for preservation is diffuse at best.
The article goes on to point out that libraries and publishers are aware of this problem and some are taking concrete steps (evidenced by projects like LOCKSS and Portico and the Cornell/Columbia initiative 2CUL) to solve it. However, even those research libraries that participate in such initiatives usually archive only a few of their eligible holdings, and not all publishers allow their ejournals to be archived by third parties.
This article, and the study it cites, together raise a couple of interesting and difficult questions.
First, are the data accurate? This is a good question, but probably not a really contentious one. Even if the 15% figure is way off, the fundamental issue remains: lots of scholarly content is not being preserved in any kind of rigorous or even reasonably systematic way. I don’t think anyone would dispute that.
Second, how big a deal is this? That question is tougher and more fraught.
During the print era, scholarly publishers weren’t generally expected to perform a robust and reliable archiving function; they produced books and articles, sent them out into the world, and generally left it to others to worry about ensuring those products’ permanent curation. It was understood by everyone in the scholarly information chain that the fact that Yale University Press published a book in 1945 didn’t mean the press would necessarily still be making it available in 1965, let alone 2005. For the most part, archiving the book and ensuring its long-term availability to scholars was simply not part of the publisher’s remit. The same was generally true for scholarly journals.
The archiving-and-access function was performed by libraries—more specifically, by very large academic research libraries. But today, research libraries increasingly pay for online access (usually hosted by the publisher or a third-party aggregator) rather than purchasing physical copies of documents and curating them locally. Such an approach solves lots of problems for students and scholars by making access available remotely, around the clock, and by multiple simultaneous users, and by making it possible for libraries to offer access to far more content than they ever could have provided during the print era. But it also creates problems, among them the one pointed up by this report: a diffuse and ambiguous archiving mandate.
The report raises obvious and fairly urgent operational questions, and by largely ignoring them from this point on in this posting I hope I don’t give the impression that I’m dismissing them. It’s not that I think they’re unimportant—it’s just that a) I have no answers to those questions and b) I know there are lots of very smart people (like Vicky Reich and Kate Wittenberg and the folks involved in the wonderful 2CUL initiative) working on them.
What I want to do here, instead, is back up and ask a larger and maybe even more troubling question: how important is it that we archive all of the scholarly record?
I realize this question may sound crazy. How could any reasonable person (a librarian, no less) suggest that the scholarly record doesn’t need to be robustly and fully archived? I’m not saying that it doesn’t, but I am suggesting that we should stop and think before we automatically assume that it does—and that if we do decide that it does, we need to make ourselves fully aware of the scale of project we’re talking about.
Because let’s be clear about this: to say that we must archive 100% of the scholarly record is to propose an unbelievably monstrous undertaking. In 2010, the University of Ottawa’s Arif Jinha estimated that roughly 50 million scholarly articles had been published since 1665, and that about 1.5 million more would be published during the year in which he was writing. Citing Mark Ware, he predicts annual growth of this number at a rate of 3%. If these numbers are accurate, then simply identifying and tracking the creation of all scholarly articles is a gargantuan task, and it will be dwarfed by the project of systematically capturing, describing, and robustly archiving them.
Now obviously, no one expects that this project would be taken on by a single organization. The only way a comprehensive archive could possibly be created would be as a coordinated effort on the part of many entities. And in that word — “coordinated” — lies a challenge far greater than the already massive one of simply identifying and tracking 1.5 million+ articles per year.
One of the nice things about the old approach to archiving was that it was pretty much inadvertent — it happened organically and mostly without coordination as thousands and thousands of libraries around the world independently built their local collections. But that organic inadvertence hid enormous cost and terrible inefficiency. It also provided only an illusion of completeness and robustness; since there was no coordination, there was never any guarantee that the distributed archive resulting from all that collecting was truly comprehensive, or that if it was comprehensive today, it would remain so next year. If a well-coordinated, robust, and comprehensive scholarly archive was illusory in the print realm, it’s little more than a pipe dream in the online era, given the explosion of new documents and the wild and expanding variety of scholarly products.
Okay, so maybe we just have to accept the fact that an incomplete scholarly archive is inevitable. But this leaves us with another problem, because to say that it’s okay to archive less than 100% of the scholarly record is to reject a (probably impossible) program of comprehensive collecting in favor of an (overwhelmingly difficult) program of discrimination. Who will decide what will be robustly archived and what will not? What are the criteria, and who will determine them? Who will manage the process of discrimination? Who will pay for it?
The older I get, the more impatient I become with people who approach difficult issues with the attitude of “I have no answers; I bring only questions.” (I always want to respond “Whoa, dude, that’s really deep. But thanks for nothing.”) Honestly, though, I don’t know what else to say about this issue. The only really constructive proposal I can make is this: before we try to tackle the logistically daunting problem of comprehensive e-journal preservation, we’d better make sure we’ve addressed the politically daunting problems of deciding — in a rigorous and rational way — exactly how much of that problem we’re able to tackle, and then how we’re going to choose what gets left out. Because make no mistake: there is no way to avoid leaving something out. Better we should make that decision consciously (and painfully) than leave it (more comfortably, but less usefully) to chance and inertia. Do we have the guts to do that?