Celebrating 25 Years of Preserving the Web

There is a great series of phototography books published by Arcadia Publishing, focused on how cities appeared more than a century ago. They’re not the only ones publishing in the genre. A book in my own collection is looks at my home town, Baltimore, and how its landscape has changed over time. It’s not so much that I’m interested in architecture, but more so in history. One of the most amazing things I find with vintage photography is the sense of change and how things that seemed so familiar can be so foreign. Some vestiges of the city’s architecture remain, but it’s fascinating to see and think about how the landscape has changed over time. Without the photos or drawings though, we would be left to only imagine what it might have been like.

In the world of information and scholarship, libraries and archives play a similar role — not of simply of collecting and circulating content, but for preserving it for future generations and providing a window into how the world used to look and how the people in that world interacted and engaged with each other at that time. Long have these cultural institutions preserved our heritage and history. They have been adept stewards caring for things that we, in the present, might not recognize as valuable in the future.

Internet Archive Server — An image of the Internet archive server stack

As more and more of our lives moved from physical media, physical items, and physical meetings toward digital representations of those things, those windows into the past are fading. The connection to our digital past is ever more fleeting, even though resources to capture and preserve it are expanding. A few decades ago, all the computer storage in the world would be able to fit onto a hard drive that can be purchased today for a few hundred dollars. One person, Brewster Kahle, saw the possibility of a global digital library and matched that vision with a pool of resources to make it happen, and in 1996 he did.

This week, the Internet Archive is celebrating its 25th anniversary of its first crawl of the world-wide-web and its first snapshot of our collective digital lives. Since the first crawl, the scale of the Internet Archive’s collection was grown astoundingly. Its regular snapshots of the internet have grown to comprise some 588 billion web pages in total. So much flourishes on the internet; both the amazing and the frightening, the mundane and the radical. The Internet Archive preserves it all. While most know of its work capturing and preserving snapshots of the internet, the Internet Archive also preserves an amazing amount of our digital heritage. The Archive has digitized and archived more than 28 million books and texts, 14 million audio recordings, 6 million video recordings, 3.5 million images and more than a half million software programs.

Unfortunately, nothing lasts forever. This is a reality of life. This is doubly true of the Internet and web pages. For all its strengths, digital content, in particular the world wide web is a fundamentally fragile and ephemeral thing. It is easily changed, and content is constantly created, edited, modified, and deleted. Websites were designed to be simple to update and simple to manage, but that led to ever greater challenges with the network. It is one of the reasons that links deteriorate at the rate of about 0.5% per week, a rate that was consistent for several years. Systems like the DOI have sought to address this, but another approach has been to use archived versions of the web in citations, often pointing to the Internet Archive.

Another challenge is that fundamentally everything on the internet is rented. We may own the rights to domain names, but only so long as the rents are paid to ICANN and a domain registry. Those domains only resolve as long as we maintain a server somewhere, if we pay our electricity, and for internet connectivity. Indeed, even beyond this, someone needs to keep the servers up to date, lest the operating system becomes hijacked by the variety of skilled hackers and the site gets pulled down for being “owned” by malicious actors to spread spam or worse.

One could simply pull the plug when the time comes, just kill the power, or stop paying the bills, then let a site go dark. Some people have tried to wipe information from the Internet for good and bad reasons. But even in a world where an organization wants to do the right thing and seek to preserve its legacy, this isn’t always a simple thing to do.

My one experience with this began when NISO merged with NFAIS in 2019. As part of that merger, we began rationalizing and dismantling some of the systems that existed separately within each organization. For example, it didn’t make sense to maintain two member management systems. NISO used Salesforce, while NFAIS had used a software called MemberMax. Without going into the details, we decided to maintain the Salesforce CRM over MemberMax. Yet, this decision wasn’t as simple as might seem from the outset because a feature of MemberMax is that it can integrate the organization’s website, it’s member portal, and document management systems. For a small organization, having a single system where these various tools are integrated can be extremely helpful and allows for easier systems management. Where there are limited staff resources, and even more limited technical skills to manage these systems, integration makes a lot of sense.

However, following the merger, when we decided to cease using MemberMax it created problems for the old NFAIS website. How does one preserve a website when its underlying infrastructure gets turned off? How do you easily separate the content from the system? Despite the considerable investments in digital preservation, in file conversion, in preservation metadata, and other supporting tools and services, there is only one place that has a good handle on providing this service, the Internet Archive. In researching this question and the best path forward, I reached out to several the preservation organizations that exist in our space. It turns out that archiving a journal article or data set is a much more concrete problem and a far easier. Thankfully, there is a handy service provided by the Internet Archive that is less-well known than its Wayback Machine called ArchiveIt.

This is just a tiny window into one small organization and how it faces the complex issues of digital preservation. When expanded to cover all organizations around the world, it gives me awe at what the Internet Archive is and what it has achieved in the past 25 years.

The Internet Archive has been an influential force on the development and shape of digital archiving. The service most people are familiar with is the above-mentioned Wayback Machine with its regular crawls of the internet, but the Archive is responsible for so much more. Less well known are its work to preserve television broadcasts, in software preservation, and digitization of music and book content. In 2020 at the outset of the pandemic, the Internet Archive launched a project to provide digital access to some of its scanned collection, what it called the National Emergency Library. As institutions around the world were closed to in-person visits, the Internet Archive made available digital versions of the content in its collection. This led several publishers to sue, claiming copyright infringement, a case which is still ongoing. The Internet Archive has been able to push the boundaries both of what was possible technologically, but has also been a strong advocate for the rights of libraries, pushing the boundaries for library, archiving, and patron rights in ways many other organizations have been reticent to match. It has also been pushing the boundaries in how content is distributed and preserved, helping to support accessibility, file transfer protocol standards, and of course preservation standards.

It took a tremendous amount of foresight to preserve the web, but it also took a lot of resources. I recall one story I was told of someone needing disk storage to support an Internet Archive project, who called up a retail chain in San Francisco and requested to buy every hard drive the company had to sell. Only when the Archive’s staff person started talking about arranging a semi-truck to have the disks delivered, did the retail customer service representative realize that the staff person was serious.

Initially funded by Kahle’s sale of internet services that he helped to found, it has also been supported by philanthropic and government grant projects, as well as donations from individuals. By comparison, the Internet Archive’s annual budget (while it fluctuates year to year) is about the size of a large research university library. Given its relative size and staff, it certainly is maximizing its impact.

This Thursday October 21, at 6:00 pm PST (GMT 02:00) the Internet Archive will be hosting a free virtual celebration of the Internet Archive, and its 25 years of work. We should all celebrate this important cultural institution and the work its team has pioneered since its founding in 1996. Twenty-five years from now, we will all be able to revisit this post, likely because it will have been preserved by an Internet Archive’s web crawl of The Scholarly Kitchen. I thank them for that.

Todd A Carpenter

Todd Carpenter is Executive Director of the National Information Standards Organization (NISO). He additionally serves in a number of leadership roles of a variety of organizations, including as Chair of the ISO Technical Subcommittee on Identification & Description (ISO TC46/SC9), founding partner of the Coalition for Seamless Access, Past President of FORCE11, Treasurer of the Book Industry Study Group (BISG), and a Director of the Foundation of the Baltimore County Public Library. He also previously served as Treasurer of SSP.

Discussion

24 Thoughts on "Celebrating 25 Years of Preserving the Web"

I’m astonished that the Scholarly Kitchen would publish such sycophantic rubbish. Brewster Kahle is not a folk hero. He’s a book pirate. Most of those millions of “heritage” books on his site are under copyright. He believes that if he purchases one copy of a book and scans it he can make it available to anyone in the world without compensating the author or publisher. Meanwhile, his own intellectual property is sold to the highest bidder.

See also this comment to an equally problematic post on this site last week about book digitizing:
https://scholarlykitchen.sspnet.org/2021/10/11/book-review-along-came-google-a-history-of-library-digitization/#comments

Have even the readers of this site got the piracy virus? If nothing else, all your librarian jobs are on the line.

By Horatio
Oct 19, 2021, 6:51 AM

The author of this post links to but does not quote from an article about the Internet Archive’s (IA) scanning and illicit distribution of books without compensation to authors or publishers.

Just to give a little balance (whatever happened to balance?), here are a few paragraphs from that article. Note the disingenuous comment by Brewster Kahle. I welcome any explanation, per Kahle’s comments, of how the IA’s scanning activities support authors and publishers. If the readers of this site are content to let him compare his activities to a library, then I guess I’m in the wrong place.

——————-

In its release, the Authors Guild took aim at IA’s argument that its scanning activities are supported under the controlled digital lending legal theory. “Internet Archive’s wholesale scanning and posting of copyrighted books without the consent of authors, and without paying a dime, is piracy hidden behind a sanctimonious veil of progressivism,” said Douglas Preston, author and president of the Authors Guild. “The Internet Archive hopes to fool the public by calling its piracy website a ‘library,’ but there’s a more accurate term for taking what you don’t own: it’s called ‘stealing.’”

Preston continued by noting that while authors want the public to have access to books, free e-books are available through libraries. “Legitimate libraries pay for those e-books, and a portion of that flows back to authors as royalties, helping ensure they can continue to write,” Preston said.

In an email, Brewster Kahle, founder of IA, said “As a library, the Internet Archive acquires books and lends them, as libraries have always done. This supports publishing and authors and readers. Publishers suing libraries for lending books, in this case, protected digitized versions, and while schools and libraries are closed, is not in anyone’s interest.”

By Horatio
Oct 19, 2021, 8:07 AM

I am not an author, but in a blog by the Authors Alliance this week, this authors’ organization outlines ways that the Internet Archive’s work enables and enhances the work of authors:
https://www.authorsalliance.org/2021/10/19/happy-25th-birthday-to-the-internet-archive/

By Wendy Hanamura
Oct 20, 2021, 12:56 PM

Hi Horatio, just a reminder that the Scholarly Kitchen is not a singular entity, and the opinions expressed in any particular post are solely those of the author of that post. Personally, I’m a huge fan of the Wayback Machine, but like you, I see the “Emergency Library” as a ruthless land grab.

In your follow-up comment, you ask for balance, and as always, we welcome a wide variety of viewpoints and commentary through guest posts. If you’re interested in writing one up, details are here:
https://scholarlykitchen.sspnet.org/2018/06/07/be-our-guest-author/

By David Crotty
Oct 19, 2021, 8:53 AM

I think it is incumbent on the author of a general post to provide a little balance (except in rare cases someone is invited to present a particular opinion) and not rely on readers to provide this. I also think it is incumbent on TSK to ask authors to provide this (I did not say or assume that TSK is a “singular entiity” and find your comment rather disingenuous). Finally, it is not just the “Emergency Library” that is illegal and immoral, it is the entire archive of books, or at least those not in the public domain.

By Horatio
Oct 19, 2021, 8:59 AM

Sorry, but we don’t force our authors to conform to sets of rules imposed by particular readers, nor is “balance” an automatic good. I personally think that the insistence on balance has been a major contributor to so many of the problems we face as a society, particularly the spread of misinformation. In the US, the right wing has continually forced a message that the media has a liberal bias, and so every news outlet feels a need to bend over backwards to present “both sides” of every issue, even when one side is factually correct and the other is obviously mendacious. We’ve reached a point where we have school officials telling teachers that if they have a book in their classroom about the Holocaust, they must also present the children with “opposing views” (https://www.washingtonpost.com/education/2021/10/15/holocaust-texas-school-books-opposing/).

TSK offers a variety of posts, some meant to be informative, others are flat-out editorials offering an opinion from a particular point of view (and many, like this post, are a mix of the two). Any author expressing such an opinion is under no obligation whatsoever to also include support for the opposite opinion. Again, if you want to write a piece stating your point of view on this subject (or any other), we welcome guest posts (and you would not be under any obligation to say nice things about Kahle).

By David Crotty
Oct 19, 2021, 9:08 AM

Oh jeez. Deflect and hijack the discussion with completely beside the point comparisons. Did you expect me to disagree with your view of the right-wing “balance” debate in the mainstream media? How is it relevant to my complaint that a general article about the anniversary of the Internet Archive gives short shrift to legitimate complaints about it? Is that the same as asking that creationism be taught in schools?

By Horatio
Oct 19, 2021, 9:21 AM

You insisted that all authors be required to provide balance in their writing. I made the point that 1) this is not something we require of authors, and 2) strict requirements as you are suggesting can lead to problems and drive the spread of misinformation. To me, that’s a direct response to your statement and I’m sorry if you see it as a deflection.

If you want to write an article stating your complaints about the Internet Archive, then go for it. This author did not have those same complaints.

By David Crotty
Oct 19, 2021, 9:25 AM

There’s a flip side to your right-wing “balance” examples of course. Perhaps I inadvertently chose a term fraught with meaning in American social discourse. But leftists have long complaind that politicos and the like are able to make outrageously false, misleading, self-serving, hypocritical etc etc statements to the press and have them quoted verbatim without question. With Trump, I think that has begun to change, and mainstream journalists are beginning to add: “In fact . . .” after such statements. So it would have been entirely appropriate and desireable, after a statement such as this in the post

The Archive has digitized and archived more than 28 million books and texts, 14 million audio recordings, 6 million video recordings, 3.5 million images and more than a half million software programs.

For the author to add

In fact, much of this material is under copyright and is being made available without permission and without benefit to the creator.

Anything less is propaganda of the kind often seen regurgitated in the lazy mainstream media you defend against attacks from the right. There is a progressive way of looking at the same issue. It involves presenting all relevant facts around the topics you discuss and not, in this case, just those you like. Leave that to people in the comments section. A site like this is relied on, for example, by teachers, and thus students, for background and context on topical issues. The present post is presented as a general information piece, but is laden with contentious opinions not fully identified as such and not contextualized by relevant facts. What survives is the post, not the reader comments. The post should have a modicum of intellectual honesty.

By Horatio
Oct 19, 2021, 10:49 AM

I’m not sure that we have the same definition of the word “fact”. As noted in the post, the very things you state have been raised by publishers and there is a court case pending. It has not been legally established if the Internet Archive has infringed in this particular instance, so claiming it to be a “fact” is jumping the gun.

Again, this post states the author’s opinion, written from a viewpoint with which you may disagree. So it goes. When I read the editorial page of the NY Times or the Washington Post, I often see editorials with which I disagree, and that are presented from a viewpoint that I think is incorrect, or that leaves out other information that I might use in my own contrary argument. That does not make the author dishonest, nor should they be obligated to include every other possible viewpoint or every other potentially relevant “fact”. Both of those publications offer space to authors of varying positions, rather than restricting what those authors write. This is the nature of opinion pieces.

By David Crotty
Oct 19, 2021, 11:00 AM

And it is outright false to claim, as this post does in its comments on the site’s books, that the site preserves “our digital heritage.” In most cases it purchases a single copy of a print book and scans it. The site has factories for this. It is not digital heritage. It is printed and bound books turned into illicit digital scans.

My proposed addendum did not say that there was infringement as you imply I said; it said that the materials are under copyright and were copied without permission.

By Horatio
Oct 19, 2021, 11:09 AM

Nothing in the following paragraph, quoted in its entirety, identifies it as an ‘opinion piece” or “editorial.” It is presented as a brief *factual* summary of the site in question. The “opinions” are conveyed, subliminally if you will, by what is left unsaid, not by what is said.

By Horatio
Oct 19, 2021, 11:05 AM

I’ll refer you to the blog’s “About” page:
https://scholarlykitchen.sspnet.org/about/
Here it specifically states that the contents include differing opinions and ideas, and that they are meant to be presented both in a “balanced” way or deliberately in a “provocative” way.

By David Crotty
Oct 19, 2021, 11:08 AM

But not in a way which appears neutral and balanced, rather than provocative, but which is not.

By Horatio
Oct 19, 2021, 11:10 AM

Your guidelines state that the site, through its blog posts, will:

Interpret the significance of relevant research in a balanced way (or occasionally in a provocative way)

And yet you swore up and down that the site does not “do” balanced. You’re the editor-in-chief, the guidelines are only a hundred words long. Are you not familiar with them?

And so, after speaking down to me, deflecting, tossing out red herrings, putting words in my mouth, insisting that the site is not interested in balance, only opinions, it turns out that you either didn’t know your policies or were bluffing and stonewalling.

This post is not “provocative.” It is, as it says in its headline, a “celebration” of the baseball-and-apple-pie Internet Archive, something we can all love and adore. Not being “provocative,” it is supposed to be “balanced,” something you acknowledged over and over it isn’t. Of course I won’t hold my breath for an apology.

And now here’s the man himself, Brewster Kahle, telling us all about his open source bona fides and how he was never really interested in money. I’m not sure where the tens of millions from AOL and Amazon for his intellectual property – such an outmoded concept! – fit into that little fictional narrative, but whatever.

By Horatio
Oct 19, 2021, 8:02 PM

Wow. For a post that wasn’t at all provocative, you sure seem provoked.

By David Crotty
Oct 19, 2021, 8:07 PM

I was expecting preciely that flippant response. I was provoked not by the provocativeness of the post, as it is plain it was a mom and apple pie post, not a provocative post, but by its lack of balance, and then by your truly unprofessional response, which is only compounded by this openly insulting and belittling last word, with nary a mention of the guidelines you misrepresented. Truly inappropriate from the official representative of the site, especially given how you blatantly misrepresented your own policies. Disgusting.

By Horatio
Oct 19, 2021, 8:36 PM

I do apologize for that last flippant remark, but let’s be fair, I’ve tried my best to respond to your comments and there seems little I am able to do to satisfy you. You have accused me of deflecting, hijacking, being a propagandist, running a blog that is intellectually dishonest, throwing out red herrings, putting words in your mouth, bluffing, stonewalling, and being disgusting, among other implied insults. All, even though I largely agree with your sentiments on the subject matter of this post and have stated so, because I refuse to force an author to bend to your requirements for how we are apparently “allowed” to write on this site.

I will try one last time:
This is largely an opinion blog, where authors voice opinions. We do not have formal categories of posts (“balanced” versus “provocative”). The guidelines on our “About” page are just that — guidelines, not hard and fast rules we enforce to strictly limit our authors. We try to put as few restrictions on those authors as possible. When our authors voice an opinion (many of which I disagree with), we do not require them to also voice all other possible opinions on the subject they are discussing. We do invite those with different opinions to voice them in our comments (as you have done) or in a guest post (as you have been invited to do). I am truly sorry if you do not find this an acceptable process, and would suggest you start your own blog that better represents the expression of ideas in a manner you feel is more appropriate.

By David Crotty
Oct 19, 2021, 8:50 PM

It does give pause to think that the vast majority of the present scholarly body of knowledge is stored in a rental. I recall reports that a few years ago, the Crossref DOI administrators forgot to make an annual payment to keep domain registry and the whole system briefly went dark. Can only imagine the disruption if a domain-name buccaneer had snatched it while available.

By Chris Mebane
Oct 19, 2021, 5:33 PM

Yes but then again, nothing is really permanent and all infrastructure requires constant upkeep.

In some respects the rental is even more reliable. By providing the ISP with a constant stream of revenue, we ensure that they have the resources to provide us with servers in a properly maintained and secured data centre. If I just bought a computer and plugged it into the internet, I would own it but it wouldn’t have routine maintenance and a 99.9% uptime SLA.

By Anonymous
Oct 19, 2021, 10:09 PM

You’re getting into an issue that academic librarians definitely think about a lot, as we’ve moved heavily from print serials (and now books) ownership to electronic serials (and books) licensing. Many of those licenses include “perpetual access rights” but access does depend on the continuing payment of expenses keeping those servers running. That’s why many libraries like my own think it’s a worthwhile investment to support projects like Portico, LOCKSS, and CLOCKSS, which provide independent “dark” archives of scholarly publications, so that if the main publisher platform is lost, the scholarly work is not lost to the world. The lesson here is that the solution to the “rental” concern is redundancy, and not just backup copies by the one company, but copies by lots of different organizations in different parts of the world (and Internet). In a sense, isn’t that what we’ve always done to preserve print? After all, one library may have a flood or fire, but except for Special Collections, there are surely more copies of everything in it in other libraries far away, so the scholarly knowledge is never lost due to one big disaster. You can frame the work of the Internet Archive as another example of redundant copies to preserve information.

By Melissa Belvadi
Oct 20, 2021, 9:41 AM

Absolutely! Lots of Copies Keeps Stuff Safe is not just a technology, but a way of life. The Internet Archive can even be named as an archive location in Crossref metadata, even though it is not a member of the Keepers Registry. And to reinforce your point about redundancy, I believe the Keepers even recommend using multiple archiving agencies as best practice.

By Anonymous
Oct 20, 2021, 8:18 PM

Todd–

Thank you ever so much for your post. It is most satisfying, especially working in a non-profit, to hear such things.

Todd, I want to tip my hat back to NISO for a moment– when I was getting other organizations involved in a publishing system for the Internet in 1989 (Wide Area Information Server WAIS), I wanted to base it on open protocols rather than private, licensed protocols. This caused me to not work with one up-and-coming computer maker and to working with Apple Computer (along with Thinking Machines, Dow Jones, and KPMG Peat Marwick) where using NISO’s Z39.50 was acceptable and encouraged. NISO was helpful in supporting this new Internet service with their protocol designed for libraries. As it turns out, the World Wide Web came up with the winning formation, but again it was an open system an open protocol, so we could all get behind it. NISO helped then and is still helping.

May we continue to build open systems where there are many winners.

-brewster

By Brewster Kahle
Oct 19, 2021, 6:31 PM

I in turn would like to thank YOU ever so much for the work you’ve done! As a librarian, I have found your Archive invaluable more times than I can count. You probably have the server logs to prove it without testimonials, but I’m sure hundreds of my colleagues would say the same. And I am not in the least threatened professionally by your work with books, completely the opposite – I have used it and shown patrons how to use it, and never more so than during the pandemic, when it was absolute lifesaver when our library print collections were inaccessible for many months. Anyone who thinks otherwise clearly has no idea what librarians actually do, and apparently think we are clerks managing the stereotypical “warehouse of books”.

By Melissa Belvadi
Oct 20, 2021, 9:45 AM

Todd A Carpenter

Related Articles:

Next Article: