Editor’s Note: Today’s post is by Kate Wittenberg, Managing Director, Portico, and Sheila Morrissey, Senior Researcher, Portico. Previously, Kate served as project director of client and partnership development for Ithaka S+R, and worked at Columbia, where she was the editor-in-chief of Columbia University Press until 1999. Sheila came to Portico with many years’ experience in the design and development of complex software systems and in the management of large-scale software engineering projects, including projects for higher education and for print and electronic publishing.
There’s been a lot of focus in The Scholarly Kitchen lately on the infrastructure of scholarly communications, both in seeing key pieces of that infrastructure purchased by individual publishers, and in suggestions for building out new infrastructure. Here, we wanted to call attention to preservation, an often overlooked yet essential element of publishing, as well as to think beyond just building infrastructure. It’s one thing to make a new tool or service, but there’s a long-term commitment needed to maintain and allow that infrastructure to evolve to best suit user needs. We thought Portico’s recent rebuild would make an interesting case study in how conditions change and how infrastructure needs a continuous investment to thrive.
In the last three decades there has been a tremendous increase in the amount of digital content created by libraries, publishers, cultural institutions, and the general public. There are great benefits to having content available in digital form. However, unlike print objects – which, when they have been printed on acid-free paper and are held in reasonable conditions, can last for many decades with only minimal attention – digital objects can be extremely short-lived unless proper attention is paid to preservation. Long-term preservation of this digital content is a key concern for the scholarly community as we continue to make the print-to-digital transition.
In 2005, a report from the Andrew W Mellon Foundation, Urgent Action Needed to Preserve Scholarly Electronic Journals was released with wide support from the academic library community. Third-party preservation services such as Portico emerged in this context, in response to the needs articulated by academic libraries, whose increasing adoption of e-journals and other online resources provided enhanced ease of access to content, but required giving up traditional physical control over maintaining it. In the decade since Portico launched, both the scale and type of preservation work we do has continued to evolve. The amount of content that Portico preserves is enormous – our archive today is 1,600 times the size it was in 2006. We realized it was time for a major reinvestment to ensure that Portico has the technical capacity in place to support our ongoing growth.
Portico: The Early Years
The scope of Portico’s work for the first several years of its existence was the preservation of e-journal content, with the ultimate goal of providing the academic community with access should that content no longer be available online through the publisher or a successor (in Portico parlance, a ‘trigger event’). The cost of doing the important work of preservation was shared by the academic library and scholarly publishing communities, thereby ensuring that a broad range of content could be preserved, and that the service would be sustainable over the long term. By 2009, Portico began preserving e-book content, also on a community model, and launched a service specific to digitized historic collections, which is exclusively publisher supported. In 2006 Portico was supported by 27 publishers and 245 libraries. Currently, there are 572 publishers (representing over 2,000 learned societies and associations) participating in Portico. Committed content from these publishers comprises 28,468 e-journal titles, 1,231,039 e-book titles and 187 digitized collections. Library support has also grown steadily over the last decade. Today, more than 1,000 libraries around the world are Portico participants.
The original drivers for third-party preservation services still exist. Indeed, they have, if anything, intensified with the dramatic increase in both the breadth and depth of scholarly content available online. Libraries are increasingly opting for online-only access to journals and books, foregoing print altogether. Many resources are now available only in an online format from content providers. The shift from digital content as a secondary access mode for print, to digital content as the version of record and primary access mode, makes more acute the necessity for robust, third-party preservation and the assurance of future access that it provides.
Ultimately, preservation services exist to ensure that scholarly content made available online today will always be available in the future. Library support for, and participation in, third-party services is predicated on the promise of ongoing usability and future access when necessary.
The ever-increasing rate of growth in the volume of content ingested into the Portico archive (from 160,000 archival objects in 2006, to over 98.5 million in 2018; from roughly 1 million files in 2006 to nearly 1.6 billion files in 2018) is an indicator of the sheer scale of the preservation challenge. And, of course, software continues both to advance and to obsolesce. This includes the software put to the service of collecting, curating, and preserving scholarly and other content, as those engaged in creating, maintaining, and, ultimately, replacing it are only too acutely aware. As the phrase “software development life cycle” indicates, the very act of systems development itself is predicated on an endlessly recurring process of design, development, maintenance, and ultimately, disposal, as the transition begins to another round of the cycle.
A Major Rebuild
In 2016, having taken stock both of the capacities and deficiencies of its then more than ten-year-old digital preservation technology infrastructure, Portico invested in a two-year project to rebuild that infrastructure from the ground up, retaining the preservation “intelligence” of the original system, but performing “gut rehab” on its lower-level hardware and software underpinnings. We undertook the project to create a stable, scalable, elastic architecture that will enable Portico to continue providing its existing level of services and products while keeping pace with ever-increasing content. It will also eventually allow us to provide preservation for new, complex, and dynamic forms of scholarly publishing.
In our development process, we wanted to leverage the upside of the rapid rate of change in both hardware and software since the original architecture design eleven years before. We took advantage of cheaper commodity hardware, commodity cloud computation and storage capabilities, and well-developed free, open-source software projects for automated workflows and for data capture and analytics. Portico’s technical group was expanded for the course of the project to include two new teams (two developers and one quality assurance engineer for each), with additional help from an outside consulting group to re-engineer the graphical user interfaces used to track Portico production workflows.
The rebuild project, completed at the end of September, has already resulted in significant increases in the speed and capacity of content processing, management, and storage that has scaled easily to accommodate a more than 50% increase in the number of files, and a more than 100% increase in size of archival storage. In addition, we have achieved increased productivity as a result of the automation of archive management tasks such as replication, storage migration, and content delivery. In the coming year we plan to leverage the new architecture to achieve additional benefits, including the ability to manage content that arrives in incomplete form, accommodate more “digitally native” content that might require emulation of older software for effective delivery in the future, and enable cross-institutional and cross-corpora discovery and use of content for such things as text and data mining. These initiatives are part of our ongoing work to develop best practices that align with and support the work of our colleagues in the preservation community.
What we have learned in the course of our work over the years is that the nature of the content we preserve and the technology that supports its preservation will constantly evolve, that responsible preservation requires active management over time, and that we must have the ability to invest in and improve our systems and capabilities in order to keep up with the constantly changing environment.