There is a great series of phototography books published by Arcadia Publishing, focused on how cities appeared more than a century ago. They’re not the only ones publishing in the genre. A book in my own collection is looks at my home town, Baltimore, and how its landscape has changed over time. It’s not so much that I’m interested in architecture, but more so in history. One of the most amazing things I find with vintage photography is the sense of change and how things that seemed so familiar can be so foreign. Some vestiges of the city’s architecture remain, but it’s fascinating to see and think about how the landscape has changed over time. Without the photos or drawings though, we would be left to only imagine what it might have been like.
In the world of information and scholarship, libraries and archives play a similar role — not of simply of collecting and circulating content, but for preserving it for future generations and providing a window into how the world used to look and how the people in that world interacted and engaged with each other at that time. Long have these cultural institutions preserved our heritage and history. They have been adept stewards caring for things that we, in the present, might not recognize as valuable in the future.
As more and more of our lives moved from physical media, physical items, and physical meetings toward digital representations of those things, those windows into the past are fading. The connection to our digital past is ever more fleeting, even though resources to capture and preserve it are expanding. A few decades ago, all the computer storage in the world would be able to fit onto a hard drive that can be purchased today for a few hundred dollars. One person, Brewster Kahle, saw the possibility of a global digital library and matched that vision with a pool of resources to make it happen, and in 1996 he did.
This week, the Internet Archive is celebrating its 25th anniversary of its first crawl of the world-wide-web and its first snapshot of our collective digital lives. Since the first crawl, the scale of the Internet Archive’s collection was grown astoundingly. Its regular snapshots of the internet have grown to comprise some 588 billion web pages in total. So much flourishes on the internet; both the amazing and the frightening, the mundane and the radical. The Internet Archive preserves it all. While most know of its work capturing and preserving snapshots of the internet, the Internet Archive also preserves an amazing amount of our digital heritage. The Archive has digitized and archived more than 28 million books and texts, 14 million audio recordings, 6 million video recordings, 3.5 million images and more than a half million software programs.
Unfortunately, nothing lasts forever. This is a reality of life. This is doubly true of the Internet and web pages. For all its strengths, digital content, in particular the world wide web is a fundamentally fragile and ephemeral thing. It is easily changed, and content is constantly created, edited, modified, and deleted. Websites were designed to be simple to update and simple to manage, but that led to ever greater challenges with the network. It is one of the reasons that links deteriorate at the rate of about 0.5% per week, a rate that was consistent for several years. Systems like the DOI have sought to address this, but another approach has been to use archived versions of the web in citations, often pointing to the Internet Archive.
Another challenge is that fundamentally everything on the internet is rented. We may own the rights to domain names, but only so long as the rents are paid to ICANN and a domain registry. Those domains only resolve as long as we maintain a server somewhere, if we pay our electricity, and for internet connectivity. Indeed, even beyond this, someone needs to keep the servers up to date, lest the operating system becomes hijacked by the variety of skilled hackers and the site gets pulled down for being “owned” by malicious actors to spread spam or worse.
One could simply pull the plug when the time comes, just kill the power, or stop paying the bills, then let a site go dark. Some people have tried to wipe information from the Internet for good and bad reasons. But even in a world where an organization wants to do the right thing and seek to preserve its legacy, this isn’t always a simple thing to do.
My one experience with this began when NISO merged with NFAIS in 2019. As part of that merger, we began rationalizing and dismantling some of the systems that existed separately within each organization. For example, it didn’t make sense to maintain two member management systems. NISO used Salesforce, while NFAIS had used a software called MemberMax. Without going into the details, we decided to maintain the Salesforce CRM over MemberMax. Yet, this decision wasn’t as simple as might seem from the outset because a feature of MemberMax is that it can integrate the organization’s website, it’s member portal, and document management systems. For a small organization, having a single system where these various tools are integrated can be extremely helpful and allows for easier systems management. Where there are limited staff resources, and even more limited technical skills to manage these systems, integration makes a lot of sense.
However, following the merger, when we decided to cease using MemberMax it created problems for the old NFAIS website. How does one preserve a website when its underlying infrastructure gets turned off? How do you easily separate the content from the system? Despite the considerable investments in digital preservation, in file conversion, in preservation metadata, and other supporting tools and services, there is only one place that has a good handle on providing this service, the Internet Archive. In researching this question and the best path forward, I reached out to several the preservation organizations that exist in our space. It turns out that archiving a journal article or data set is a much more concrete problem and a far easier. Thankfully, there is a handy service provided by the Internet Archive that is less-well known than its Wayback Machine called ArchiveIt.
This is just a tiny window into one small organization and how it faces the complex issues of digital preservation. When expanded to cover all organizations around the world, it gives me awe at what the Internet Archive is and what it has achieved in the past 25 years.
The Internet Archive has been an influential force on the development and shape of digital archiving. The service most people are familiar with is the above-mentioned Wayback Machine with its regular crawls of the internet, but the Archive is responsible for so much more. Less well known are its work to preserve television broadcasts, in software preservation, and digitization of music and book content. In 2020 at the outset of the pandemic, the Internet Archive launched a project to provide digital access to some of its scanned collection, what it called the National Emergency Library. As institutions around the world were closed to in-person visits, the Internet Archive made available digital versions of the content in its collection. This led several publishers to sue, claiming copyright infringement, a case which is still ongoing. The Internet Archive has been able to push the boundaries both of what was possible technologically, but has also been a strong advocate for the rights of libraries, pushing the boundaries for library, archiving, and patron rights in ways many other organizations have been reticent to match. It has also been pushing the boundaries in how content is distributed and preserved, helping to support accessibility, file transfer protocol standards, and of course preservation standards.
It took a tremendous amount of foresight to preserve the web, but it also took a lot of resources. I recall one story I was told of someone needing disk storage to support an Internet Archive project, who called up a retail chain in San Francisco and requested to buy every hard drive the company had to sell. Only when the Archive’s staff person started talking about arranging a semi-truck to have the disks delivered, did the retail customer service representative realize that the staff person was serious.
Initially funded by Kahle’s sale of internet services that he helped to found, it has also been supported by philanthropic and government grant projects, as well as donations from individuals. By comparison, the Internet Archive’s annual budget (while it fluctuates year to year) is about the size of a large research university library. Given its relative size and staff, it certainly is maximizing its impact.
This Thursday October 21, at 6:00 pm PST (GMT 02:00) the Internet Archive will be hosting a free virtual celebration of the Internet Archive, and its 25 years of work. We should all celebrate this important cultural institution and the work its team has pioneered since its founding in 1996. Twenty-five years from now, we will all be able to revisit this post, likely because it will have been preserved by an Internet Archive’s web crawl of The Scholarly Kitchen. I thank them for that.