In the quaint days of 2019, when the EU issued its Digital Single Market Copyright Directive (DSM), much attention was focused on issues such as a news publishers’ rights and the obligations of platforms to take down infringing materials. It seemed that outside of STM publishing, not many people engaged in discussions around the scope of the text and data mining (TDM) exceptions contained in Articles 3 (non-commercial research) and 4 (commercial research).

Generative AI changed this dynamic. After all, text and data mining is the technological approach by which generative AI systems are trained. As noted in the current draft of the EU’s AI Act, “[t]ext and data mining techniques may be used extensively in this [training] context for the retrieval and analysis of such content, which may be protected by copyright and related rights.” The current draft of the AI Act explicitly requires compliance with the DSM to access the EU market, regardless of the country in which the copyright-relevant acts of training occur.

There are, however, many open questions about the DSM, and especially the rights reservation language in Article 4 for commercial TDM which are likely to confound rights holders and AI companies alike.

Judge's gavel next to the letters A and I

DSM Articles 3 and 4 Revisited

Article 3 of the DSM, which is similar in scope to the exception that was (and is) then in place in then-EU member the United Kingdom, allows non-commercial TDM on lawfully acquired content by research organizations. As research organizations are typically publishers’ customers or using content available under open access licenses, STM publishers were generally supportive of this exception.

Article 4, which created a non-commercial exception subject to rights reservation by the copyright owner, seemed more problematic given that copyright is an “opt in” regime. However, at the time — and based on conversations I had with EU officials — the law seemed to impute a distinction between professional content, placed on the websites owned and controlled by publishers, and non-professional content such as Reddit comments and Facebook posts. My understanding is that the EU saw little harm in expecting that the former could reserve its rights when desired, while the latter was unlikely to care.

Recent lawsuits have increased my concern about this issue, especially now that text and data mining is being used as part of large-scale commercial AI.

Challenges of Rights Reservation

The rights reservation language of Article 4 provides:

  • The exception or limitation provided for in paragraph 1 shall apply on condition that the use of works and other subject matter referred to in that paragraph has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online. (italics added)

In explanatory text, the DSM states:

  • In the case of content that has been made publicly available online, it should only be considered appropriate to reserve those rights by the use of machine-readable means, including metadata and terms and conditions of a website or a service. Other uses should not be affected by the reservation of rights for the purposes of text and data mining. In other cases, it can be appropriate to reserve the rights by other means, such as contractual agreements or a unilateral declaration. Rightholders should be able to apply measures to ensure that their reservations in this regard are respected. (italics added)

This language leaves many questions unanswered. What does “machine readable” mean in this context? After all, the TDM exception is an exception to allow very smart machines to “read” and process information, so isn’t anything on a website “machine readable?” What level of granularity is required under DSM Article 4? Is a copyright notice sufficient? What about the words “all rights reserved?” Would it be enough to include “CC BY-NC” in metadata fields? Or does it need to state “commercial rights are expressly reserved under Article 4 of the DSM?” The ambiguity is troubling.

Where is the Content?

Even the foregoing unanswered questions assume the content is in the control of the rights owner. There are many situations in which this is not true.

First, there is pirated content. It has been well documented that some AI companies have trained systems on illegal sets of content. Would an EU-based court hold that the failure to have rights reservation language on illegal content means that such rights have been waived? That is highly unlikely, so let’s move to the next category.

Content may be legally posted online over the objections of the copyright owner. For example, in the recent case Am. Soc’y for Testing and Materials v. Public.Resource.Org, Inc., 82 F.4th 1262 (D.C. Cir. 2023), the Court of Appeals for the District of Columbia Circuit ruled that the non-commercial posting of standards incorporated into reference by law is fair use. It is safe to assume that the entity posting the standards over the objection of copyright owners will not take steps to reserve the copyright owner’s commercial AI rights in the EU. Would an EU-based court hold that the failure to reserve rights on a “non-commercial” website where the content is posted over the objections constitutes a waiver? Doubtful, but murky.

Let’s take this further. What about preprint servers? Today, many journal publishers allow authors to post preprints of author manuscripts on servers, notwithstanding the fact that copyright often is subsequently transferred to publishers. Does the preprint server need to expressly reserve TDM rights, or is it enough that they are reserved on the version of record? How would an AI company know it is the same? Similar questions are raised with respect to other aggregation sites such as PubMed Central and institutional repositories.

Will this Change?

Legislative changes, like lawsuits, are often a lagging indicator of the times. In 2019, the legislators in the EU seemed focused on commercial and non-commercial research aspects of TDM. They were not likely worried that well-funded commercial entities were developing AI systems through mass infringement and ignoring Article 4 rights reservation clauses, nor did they seem focused on how copyright compliant AI companies would be able to identify reservations for content on multiple sites.

In an ideal world the EU would revisit Article 4, but that is unlikely to happen. Until such time, rights owners should reserve AI rights as explicitly as possible, as granularly as possible, using machine and human readable language, and should require licensees who republish their content online to do the same. And with the AI Act removing any ambiguity about compliance requirements, AI companies seeking to train on copyrighted content would do best to license content directly from rightsholders. Relying on the absence of rights reservation language is risky, unless the AI developer is absolutely certain that it is using an official version.

Roy Kaufman

Roy Kaufman

Roy Kaufman is Managing Director of both Business Development and Government Relations for the Copyright Clearance Center (CCC). Prior to CCC, Kaufman served as Legal Director, John Wiley and Sons, Inc. He is a member of, among other things, the Bar of the State of New York, the Author’s Guild, and the editorial board of UKSG Insights. Kaufman also advises the US Government on international trade matters through membership in International Trade Advisory Committee (ITAC) 13 – Intellectual Property and the Library of Congress’s Copyright Public Modernization Committee in addition to serving on the Board of the United States Intellectual Property Alliance (USIPA).


1 Thought on "Protecting Commercial AI Rights is Harder than You Think — EU Edition"

Thank you for this great article. Very useful. I did remind me of: “For some years, the Internet Archive did not crawl sites with robots.txt, but in April 2017, it announced that it would no longer honour directives in the robots.txt files.[21] “Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes”.[22] This was in response to entire domains being tagged with robots.txt when the content became obsolete.[22]” Source:

Comments are closed.