Last week, the GLAM-E Lab published the results of an investigation into reports that servers and collections were straining – and sometimes breaking – under the load of swarming bots. Such bots attempt to scrape all of the data from a server or collection to build datasets to train AI models. This activity is overwhelming the systems designed to keep those collections online. This appears to be a growing problem that debilitates access to the very collections that the bots are using. COAR has also reported that such bots are negatively impacting open repositories.

Today, I interview Michael Weinberg, Executive Director, Engelberg Center on Innovation Law & Policy, NYU Law, and Co-Director of the GLAM-E Lab, about the study and what this phenomenon portends for information access and sustainable infrastructures.

Cover of glam-e-lab report with the title "Are AI Bots Knocking Cultural Heritage Offline?"

What first alerted your team to the potential issue of AI bots overwhelming digital collections, and why did you decide this was important to investigate further? 

We started to see initial reports about bots towards the end of last year. Bridget Almas flagged it in a post for Lyrasis in November. Then, in February, there was a Bluesky post about the Perseus Digital Library experiencing something similar. At that point, it seemed like something was happening, but we weren’t sure what or how widespread it was. If they were early indicators of a broader change, we knew it could have a real impact on how institutions approach digital collections. If that was the case, the specifics of the experiences were going to matter. We knew the only way to get those was to start talking to people. 

You found that bots tend to “swarm” collections. Can you describe what these swarms typically look like in terms of volume and duration? Are there any surprising patterns in when or how bots began targeting collections (e.g., correlation with dataset updates or external events)?

We see large numbers of bots visiting collections in a short period of time. A bot will land on a page, download everything it can, and then follow every link on the page in search of new data. Unlike people who tend to focus on predictable parts of a collection, the bots just want everything. 

Many people compare the swarms to Distributed Denial of Service (DDoS) attacks, in the sense that they quickly overwhelm the site with volume. However, unlike a traditional DDoS attack, they are not intended to knock the site offline. Instead, the site being knocked offline is a side effect of harvesting the data at scale. These incidents tend to last for relatively short periods of time (measured in minutes, maybe an hour or two). Then the swarm moves on. 

We did not find any correlation with external events, except in the general sense that they tend to be increasing over time. That increase is likely a combination of more swarms floating around the internet and existing swarms moving more aggressively. 

I’m hoping you can help us understand a bit more about the impact of the bot swarms. What were the most common infrastructure failures or slowdowns experienced due to bot activity? Are these impacts temporary or longer-lasting?

The swarms tend to overload servers. That results in slower response times for normal users. Eventually, the volume will knock the server offline entirely. The good news is that the damage is not permanent. Once the swarm moves on, the servers can return to normal operations. Of course, the team operating the collection may suddenly find themselves trying to plan for the next swarm and responding to incoming messages about the outage.

One of the interesting things we learned was that staff supporting collections may not have noticed their initial contacts with these bots. Many analytics packages (like Google Analytics) intentionally screen out bot traffic. If the first swarm that lands on a collection increases server load from 15% to 35%, the impact may not be noticeable on the site’s operation. It is only that second, or twentieth, time a swarm visits and brings the load up to 90% that things start to break. That can make it hard for some collections staff to see problems coming. The early warning signs are too faint to be seen or are obscured by how analytics tools screen out bot traffic from reports. The staff only know they might have a problem when things completely break. 

Which bot mitigation strategies appeared to be the most promising, and why do they still fall short?

There is no single solution to this problem. In the short term, many collections are turning to third-party anti-bot solutions that operate at the firewall level. Those seem to work reasonably well, although they are stuck in an arms-race cycle with the bots themselves.  

In the longer term, I think it is in everyone’s interest to find a way for tools like robots.txt to help navigate this relationship. That could form the basis of a common set of signals that collections and bots could use to set expectations around behaviors. Unfortunately, it seems unlikely that we will be able to build that consensus soon enough to mitigate some of the harms happening right now. 

Many institutions hesitate to restrict access even in the face of overloads. How do they navigate the tension between open access and infrastructure protection?

This is a real tension. It speaks to the strength of the ethos of accessibility that has developed over the last few decades that collections staff are not responding to this by immediately restricting access. Over and over, institutions we talked to were reluctant to restrict access because they recognized that doing so would create barriers for the type of users they hope to invite into their collections. They also recognized that restrictions might end up being more effective at blocking welcome users than preventing bots from overwhelming the site. 

One bit of “good” news, at least from an open access perspective, is that there is no evidence that moving away from open access licensing will help mitigate bot damage. If an object is available online, the bot will make a copy, regardless of whether or not the object is also openly licensed. Indeed, even a metadata-only resource such as a library online catalog can be swarmed by bots. Fear of bots is not a reason to move away from true open access collections that are licensed for reuse. 

What risks do you foresee if bot-related overloads continue unchecked, especially for smaller or underfunded institutions? How might this situation affect the public mission of GLAMs in the next 3 to 5 years?

I worry that bots will drive up costs for hosting current collections. It may also make institutions reluctant to create new open access and digitally accessible collections.

That would be a bad outcome for everyone involved, including the entities deploying the bots in the first place. Everyone benefits when there is a sustainable way to share collections online.  

This concern was one of the significant reasons we decided to move forward with the report. There are a range of conversations happening around AI these days. One of the primary goals of this report is to isolate the technical impact that training bots are having on collections. That’s an issue that needs to be understood independent of any policy-related concerns one might have about AI, or the relationship between AI models and the works used to train them. I hope that in 3 to 5 years, we will have come together to build a common set of behavioral expectations around AI bots, the same way we have built common expectations around search engine indexing bots and other types of bots. 

What kind of collective action do you think is needed to address this problem on a structural level?

Increasing server capacity and developing ever-more-sophisticated technical countermeasures is not a sustainable long-term response.  The only sustainable way to address this problem is to develop a set of norms and signals that everyone can operate under. That needs to be a conversation that involves collections staff, the teams deploying the bots, and standards setting organizations. I think there is an equilibrium that is technically sustainable for everyone. The only way to find it is to actually talk through everyone’s goals and constraints. 

Fortunately, we have examples of this working very well online. Robots.txt has helped sites send signals to bots for years. I’m optimistic that the efforts to evolve the protocol for these types of issues will be successful, although I am realistic that the process may take time. I hope that we are able to work together to mitigate the impact of the damage until then. 

What do you hope this report will influence — technically, institutionally, or politically?

We have a few hopes for this report. First, we hope that it will help people understand that there are technical aspects of the current AI debate that are (at least somewhat) independent of the law and policy aspects.  These bots are having an impact on technical infrastructure, and that will ultimately require some sort of technical solution. That’s true regardless of your broader opinions about AI. 

Second, we hope it will help the people who are running the infrastructure talk to their institutions about what resources they need in the medium and long term. I recognize that the solution to this problem cannot be an infinite budget for collection hosting.  However, these are small teams with a lot of responsibility. Many of them need more support from their institutions.

Third, and finally, I hope that it helps the teams deploying these bots understand that indiscriminately swarming collections has consequences. Anyone building datasets to train AI models has an interest in these collections staying online and continuing to grow. The organizations hosting these collections want them to stay online and grow as well! We need to find a way that everyone can operate sustainably.  

Fortunately, those conversations are starting to develop in a few different places. Anyone interested in getting more involved can reach out to me directly so that I can connect them with the ongoing work. 

Lisa Janicke Hinchliffe

Lisa Janicke Hinchliffe

Lisa Janicke Hinchliffe is Professor/Coordinator for Research Professional Development in the University Library and affiliate faculty in the School of Information Sciences, European Union Center, and Center for Global Studies at the University of Illinois at Urbana-Champaign. lisahinchliffe.com

Discussion

7 Thoughts on "Are AI Bots Knocking Digital Collections Offline? An Interview with Michael Weinberg"

We have had the same issues on PSIref, an openly available metadata library of scholarly literature (www.psiref.com) during the last year. Our robots.txt is completely ignored by these bots. Cloudflare has been a reasonably good and cost effective solution with the ability to target bots and AI bots. Its not perfect and still misses many of them but it at least provides platform administrators with a number of tools to stave off many of these bad actors as well as take desperate measures like creating firewall rules banning entire countries and regional ASN networks – for example, we had to finally ban most ASNs in and around China as the bots were shamelessly hitting our servers at over a million requests per second to scrape anything and everything and repeatedly knocking our servers offline or making quality of service so poor for normal users that the platform became unusable. As soon as we take those drastic firewall rules down, within 24 hours we are back to square one with the bots. What we are now seeing is these very same bots are disguising their origins (like using a VPN service) and coming out of other countries, particularly Singapore, Hong Kong, Brazil, and Germany.
In addition to robots.txt and Cloudflare solutions, we recommend your administrators/programmers create “honeypot traps” for bots on your platform. This significantly helps prevent the deleterious effects on server performance by these bots. It is certainly a game of cat and mouse and an unforeseen hit on budgetary resources required to deal with this scourge.

+1 for Cloudflare. One issue we encountered was that many proxy servers used by libraries are housed in the same infrastructure (namely, AWS) used by bots, so a process for adding exceptions is needed.

Thank you for bringing attention to this issue, Lisa and Michael. We’ve been having an awful time with our online repositories of ETDs, the scholarly work of our researchers, digitized archival collections, and more. The bot traffic has been super heavy for months now. For example, during the ETD spring submission season, everyone (our submitting students, our partners across campus tasked with reviewing and approving submissions, and my office staff) had a lot of trouble accessing the site and getting it to work the way it’s supposed to. The ETD site was sluggish, sometimes unresponsive, and frequently became completely unavailable due to “heavy load” errors. So this issue isn’t just affecting external users who want to access our collections — it’s interfering with the requirements some students have to graduate (i.e., submitting their thesis or dissertation), and the increased traffic is costing us more because we are billed for server usage. Our technical support staff are doing everything they can to mitigate the problem, but it seems like it’s a lawless, wild-west-type situation, and there’s certainly no magic bullet. I think the best we can all hope for is that AI developers realize how bad it’s gotten and rein in their bots. They are collectively shooting themselves in the foot, after all.

At OAPEN, we also had this experience of being swarmed by AI bots. We also have been looking at ways to remain available to ‘legitimate’ users (human or well-behaved bot). As we are hosted at CERN, together with the team of the CERN Data Center, we came up with several measures that are partly blocking known offenders and partly managing the extra traffic. For now, this solution works. More details are to be found here: https://oapen.hypotheses.org/1450.

I am much less optimistic than Michael Weinberg is about finding an agreement with the people running the bots. That agreement already exists: it’s called robots.txt and it is already expressive enough to communicate everything a server might want a crawler to know about how to crawl politely. The problem is that the people writing this software simply don’t care about any of that. It’s as though we’ve put up picket fences around our houses, and they’ve decided to ignore the social signal that the fences send, and step right over them. Why would we think they would show any more respect for a putative future convention?

Part of the underling problem here is the VC-based model that funds most of the AI companies. Each of them desperately needs to show their investors something very very quickly, so they don’t believe they have time to wait around for a polite harvest to finish. This is an inherently toxic situation — a tragedy the of the commons: any VC-funded AI company that respects crawling conventions is just asking to be outcompeted by its rivals who ignore them. And I’m sorry to say say I don’t see any route out of it.

Leave a Comment