Editor’s Note: Today’s post is by Kate Dohe. Kate is the Director of Digital Programs and Initiatives at the University of Maryland Libraries. She oversees a wide portfolio that includes management of Libraries’ platforms and applications, web presence and discovery strategies, digital initiatives, and the production and preservation of digital content.
Imagine a block party or a flash mob occupying your library, day after day, making it impossible for patrons to use resources or do any work. More and more library staff need to be diverted to manage these flash mobs, but have almost no mechanisms for preventing them from accessing the library without limiting access to regular patrons. Every in-house solution your access team tries only works temporarily at best, pushing your team to consider investing in expensive new turnstiles and independent security guards…from a company that happens to have close industry relationships with the flash mobs. This might seem absurd, but it’s effectively what is happening to digital library properties, encompassing digital collections, institutional repositories, catalogs, archival systems, and discovery platforms.
AI web harvesting bots are emerging as a significant IT management problem for content-rich websites across numerous industries. This is a byproduct of both the exploding market demand, as well as the technical choices and tremendous resource consumption of AI harvesters compared to traditional web crawlers. To train AI models effectively, the human operators need to collect and maintain a massive corpus of digital content. Much of that data is aggregated indiscriminately, without regard for the rights or wishes of the original creators, or of web publishers and platforms that offer the content. This activity is widely known, and a complex legal and ethical topic in and of itself.

Web crawlers, also known as “bots” or harvesters, have been an established and essential part of the internet for decades — in particular, they are what support search engines and web archives. Those bots have historically followed predictable rules, often stipulated in a “robots” text file, about the rate of harvesting, which pages to exclude, and even which bots are allowed or denied. The quantity of these traditional bot operators has been limited as well.
Libraries have benefited enormously from allowing harvesters from Google, Microsoft, DuckDuckGo, Internet Archive, and other well-behaved indexers over the past thirty years, and many open access repositories rely heavily on search engine referral traffic, as well as the Internet Archive’s Wayback Machine and ArchiveIt services. This mutually beneficial arrangement is the product of many years of open communication and collaboration between the digital library practitioner community and engineers at those organizations, which has led to the harvesting norms and data sharing standards that sustain open library systems.
AI Bot (Mis)Behavior
AI harvesters are different from these traditional crawlers in a few key ways. First, there are simply more of them — more individuals, AI researchers, and private companies are in the AI training business than have ever run search engines. Because that large and diverse group of harvesters all have different motives and methods for harvesting content, web properties experience substantially higher volumes of crawl events. That leads to increased network strain caused by bot harvesting, and puts system administrators in the position of endlessly playing “whack-a-mole” with bots. Without someone to contact about calibrating the crawler to stay under site limits (as would be the case with Google or Microsoft), the next option available to site administrators is blocking badly behaved harvesters, which only works until the next one pops up.
Second, those bots can also violate many of the established rules for web crawlers — the worst behaved bots disregard any site instructions in the robots file. In addition, rather than using a single crawler to poll periodically for updates, the AI operators deliberately send networks of bots to a single web property to download as much content as possible, as quickly as possible, creating substantial stress on the server’s resource usage and making the site or its services unstable for legitimate patrons. In the process, they attempt to get around traditional traffic management techniques by impersonating human site visitors and developing increasingly sophisticated strategies to evade any system-level blocking, so attempting to block bots means potentially blocking our human users.
Not all the actors are actively looking to circumvent limits; many are simply ignorant of the historical standards and practices that have evolved over time to manage web crawling. The ease with which anyone can now develop and deploy an AI crawler and put up their own AI service means there are thousands of new crawlers, ranging from researchers to the simply curious aspiring technologist to after-school script hackers, all entering a technology space formerly limited to large corporations. This new open frontier makes it much more difficult to negotiate and enforce good practices that impose reasonable limits on crawling without blocking legitimate use and endangering existing relationships and agreements that have worked for years.
Bots as a Denial-as-Service Attack
As a result, when these crawlers swarm a website, it effectively mimics a distributed denial-of-service (DDoS) attack, a common malicious attack designed to take a website offline by overwhelming the server with content requests. While site administrators can respond to occasional incidents in the moment, they can be exceedingly difficult to manage on a recurring basis and demand a great deal of attention from systems teams.
Depending on an organization’s infrastructure, this may affect a single application, or it may have a ripple effect on an entire group of systems by diverting resources away from other applications in an effort to keep the site being harvested online, or exhausting computing resources upstream or downstream of the affected website. These crawls are expensive, especially in cloud infrastructure like Amazon Web Services, because they can grossly inflate bandwidth and memory usage for an application incredibly quickly. They render site analytics virtually useless. They demand a large amount of attention from IT administrators, diverting them from other essential security work. Many site administrators from all corners of the web have struggled with AI harvests, ranging from old-school discussion forums to e-commerce platforms, and an entire cottage industry of enterprise IT solutions have sprung up to combat the problem.
This is a pernicious problem for academic libraries, in ways that differ from other industries. Libraries have invested a great deal of money, time, and personnel in the open digital ecosystem over the last few decades, and an individual repository might easily contain many terabytes of publicly available, unique content. We want those resources to be found and used, and we have historically encouraged bot harvesting for search and discovery of open digital content by generating structured, crawl-friendly metadata. These professional investments now make us a uniquely appealing target for AI harvesters, since they can get much higher quality data for model training purposes than other properties.
Libraries tend to generate large amounts of links in our original content like research guides, as well as our search interfaces, so crawlers that would ordinarily be “in and out” of a more traditional website stay around, consuming library resources for hours, downloading large corpora of text and media, and following generated links to search queries whose combinations can be almost infinite. Library IT teams and resources tend to be much leaner than commercial organizations, and we rely heavily on open source infrastructure to support and maintain our systems. Between AI and the rise of cybersecurity attacks on cultural heritage and higher education institutions, those teams are effectively in an unwinnable arms race. A thorough analysis of this problem and the extent of its impact can be found in Michael Weinberg’s recent white paper, “Are AI Bots Knocking Cultural Heritage Offline?”
Finally, libraries and the digital systems we maintain must fundamentally offer permanence and stability to our communities and users. We tell them that the Handle will work, that the document is available, and that the search index is responsive. Failing to deliver on that promise erodes patron trust in our systems and pushes them to other sources and services that do not have their best interests at heart.
Strategies for Response
Specific technical strategies to manage these problems are outside the scope of this article, and in fact are counter-productive to share widely, since the crawler community can then analyze those methods in order to circumvent them. However, solutions tend to fall into a handful of broad categories, all of them well-known to AI crawlers.
- IP Address Blocking: Large-scale IP blocking at the time of the event is one of the most common mechanisms, often resulting in millions of IPs being blocked temporarily. This “just in time” approach is frustrating and taxing for systems personnel and does little for network cost management. There are a variety of “AI block lists” available (some free, some paid), but these and other mass-blocking measures can also result in a higher incidence of false positives–meaning legitimate visitors are also blocked from access with no recourse. This method is now considered mostly ineffective, as AI bot networks now easily spoof addresses or can be deployed from anywhere in a geographically distributed fashion.
- Humanity Checks: Vendor-provided “humanity checks” like CAPTCHAs and Cloudflare’s turnstile product ask end users to demonstrate they’re human, either by clicking something on the page or from additional analysis of their web behavior and browser. These can be costly services for libraries, detrimental to the end user experience, increase staff time required for troubleshooting, and ultimately lead to digital accessibility barriers. They also have the undesirable effect of blocking legitimate crawlers, affecting Google site analytics tools and other automated agents we rely on.
- Firewalls: Enhanced firewall services offered by major web infrastructure providers like Amazon, which can require considerable financial investment, are often out of reach of library budgets. Ironically, these services often tout their use of AI to detect AI bots and respond to harvesting events more effectively. Ironically, they also often sell their services to AI crawlers themselves (as well as often running their own crawls), and help enable the flood of new harvesters in the first place. It can feel more than a little like paying protection money for a company to defend us from the activities of their other customers.
- Honeypots: Some web developers have developed strategies to trap crawlers, often by hiding a link on a webpage that the bots will follow and then trigger an IP ban. This can be temporarily effective but requires continual maintenance. Special care also needs to be taken to keep these honeypots from inadvertently affecting users of screen reader and other assistive technologies.
- Authentication/Whitelisting: Fully restrict access to content and systems to authenticated or whitelisted users, preventing open access to materials. Of all methods listed, this is the cheapest and most effective solution at this time, even though it is antithetical to the public access principles of many libraries.
Most institutions implement a mix of these techniques, and are constantly adjusting their strategies in a never-ending arms race with crawlers. There is no one-size-fits-all solution, and as is the case with cybersecurity practices in general, the most effective approach follows the defense-in-depth model: employ strategies at all levels of your IT environment, from the routers that direct traffic to your institution to the applications that manage and deliver your content.
The Path Ahead
AI crawlers, their impacts, and speculative solutions, raise a number of large ethical and strategic questions for libraries to grapple with. Is this really what we meant by “open”? Can we differentiate the behavior in our systems from the motives of the operating individual or organization, and should we be in the business of judging “good” and “bad” site visitors and uses? How do we evaluate the risks of technical solutions, particularly when they put libraries and patron usage data at the mercy of the same corporate interests that are driving the AI boom in the first place? If our content is not included in AI training data, are we then contributing to AI’s misinformation problems? How do we navigate the need to manage our systems when many of our institutions insist that we innovate in this area (and sometimes directly partner with an AI company)? These are hard questions that the technologists can’t answer without support and engagement from leadership.
If current trends in AI harvesting do not reverse, then libraries’ continued ability to offer open content and systems at our current investment levels will be put in jeopardy. If institutions value open content and ecosystems, then the technology teams that make it possible urgently need investment — in people, in platforms, in resilience. Those investments are crucial at the local level, where systems teams are responding at all hours to critical outages caused by bots and being diverted from other work.
Such investments are also essential across the open content technical ecosystem, so the communities that maintain open repositories and data frameworks can develop more sophisticated features and evaluate collective strategies for bot management. These conversations are already happening organically in practitioner Slack channels and open source communities. One example among many is the Aggressive AI Harvesting of Digital Resources community conversations, and its working groups like the Fedora AI Solutions and Metrics interest group. Devoting personnel time and funding to support community-level responses are essential for the open access ecosystem’s survival.
Letting library systems fail under the strain of AI harvesting comes with real costs to our credibility, our capacity to build new features, and our fundamental commitment to the stability and preservation of the content and platforms under our care. Our fraught technological and political moment makes trustworthy and open digital libraries more vital than ever before, offering what Wikimedia describes as “knowledge-as-a-service” to the world. It is imperative that we rise to the existential challenges our profession, communities, and nations face. However, our staff and infrastructure cannot keep up with those threats at their current investment levels, and our patrons and communities will soon grow tired of the flash mobs knocking over our virtual stacks and start looking for information elsewhere, likely from the AI services that are currently attacking our libraries and digital information organizations.
Acknowledgements: This article is the product of larger community and working group conversations about AI bots and the repository ecosystem, and I am grateful for the additional contributions and feedback of Andy Goldstein, Rebekah Kati, Rosalyn Metz, Scott Prater, Michael Weinberg Alexander Berg-Weiß, Robin Desmeules, Tim Shearer, and Andrea Wallace.
Discussion
2 Thoughts on "Guest Post — “Have You Proved You’re Human Today?” Open Content and Web Harvesting in the AI Era"
Thank you for this thoughtful article. Getting a better understanding of how aggressive AI crawlers affect libraries is helpful, even if there’s no “silver bullet” solution.
Thank you for this article.