Ever since the passage of the EU’s Digital Single Market copyright directive in 2019, I have been especially intrigued by the rights reservation provisions relating to the commercial text and data mining (TDM) exception in Article 4. One thing that has puzzled me is the language specifying: “The exception or limitation provided for in paragraph 1 shall apply on condition that the use of works and other subject matter referred to in that paragraph has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online.” [italics added]

TDM at its essence concerns the use of content by sophisticated machines to solve complex problems through computational power. TDM finds meaning and correlation in text. As I understood it, to the machine, everything is readable.

Maybe I was missing something in my admittedly surface-level understanding of technology.

Thankfully, I have a great go-to person on technology questions. Haralambos (“Babis”) Marmanis is Copyright Clearance Center’s polymath Executive Vice President and Chief Technology Officer. He is the author of the book Algorithms of the Intelligent Web, which introduced machine learning to a wide audience of practitioners working on everyday software applications. He holds a PhD in applied mathematics from Brown university, he co-authored the first book on Spend Analysis, and has published in numerous journals, conferences, and technical periodicals.

Critically, Babis was willing to answer my questions.

drawing of an excavator digging up binary code (1s and 0s)

It always struck me as odd that the EU specified that the “machine readable” in the context of a TDM exception which allows, for want of a better phrase, “machine reading.” What do you think of this?

Rights reservation language, whether in plain English, included in terms, or coded into, e.g., metadata, is “machine readable.” It is a choice by an AI developer to not read “human readable” rights reservation language. It is prudent best practice to reserve rights by whatever reasonable technological means are available and in human readable language, but this does not negate the fact that human readable is machine readable. Frankly, if we can build systems that can pass the Turing test, a simple statement in plain language on the website should suffice for the crawler to detect whether the content should be processed or not.

The EU was certainly capable of specifying language or required technology to use for rights reservation. Before getting into the “why,” what technologies would have been available in 2019 if the EU wanted to specify a technology, and what technologies are available today?

There are many rights expression languages that have existed since the early 2000s, if not earlier! METSRights, Open Digital Rights Language (ODRL), and MPEG-21 are examples. In particular, MPEG REL (“REL” meaning “Rights Expression Language”), as defined by ISO/IEC 21000-5, provides flexible, interoperable mechanisms to support transparent and augmented use of digital resources throughout the value chain in a way that protects the digital resource and honors the rights, conditions, and fees specified for it. Although there are differences between these RELs, it is not hard to create extensions or crosswalks.

So why do you think they did not require any specific technology?

Well, the truth is that the various RELs go into different depths in the data they specify and they take different approaches in terms of how their instructions should be processed. Perhaps the EU did not want to commit to a single standard that is best in a particular set of circumstances but not the best for other cases. This is not a material problem, in my view, but that would be a plausible explanation for their choice.

Historically, publishers and authors reserved rights with a copyright notice or the statement “all rights reserved.” STM publishers typically attach Creative Commons licenses to content that is openly available. Are those “machine readable?”

As stated above, everything is machine readable. At this stage, we could express rules in natural language and, provided use of certain conventions and consistent terminology, it would not be hard for a machine to know what to do. In other words, there is no reason why a crawler wouldn’t stop processing a page and discard a page that included the statement “All rights reserved.” If the machine is smart enough to book your appointments and make decisions for you, isn’t it smart enough to know that the publisher doesn’t want you to crawl a page? I am not advocating that as a solution, I am just making the point that any unencrypted and accessible digital asset is “machine readable”

How, from a technology perspective, could a machine “read” this human readable text?

Well, when a crawler visits a page that corresponds to one of its target URLs, it extracts the relevant information from that page, such as the title, the headers, certain keywords, image descriptions, and so on. The crawler could immediately stop processing the page if it identifies the statement “All rights reserved,” which is how a human would know that the content is not available for uses other than the one that they have been permitted to have. Structured data, such as REL based expressions could make that far more specific in terms what is allowed and under what specific terms. But from the perspective of whether you are allowed to process the page or not, even the simplest convention would work. Incidentally, the robots.txt is supposed to indicate exactly what the owner of a page wants you to do with it but it is operating merely as a suggestion, it is entirely up to the crawler whether it will comply to its instructions.

Wouldn’t that mean any rights reservation is “machine readable?”

Yes, any rights reservation today is machine readable. Of course, if one is interested in expressing the rights in great detail and in capturing complicated use cases then a rights expression language would have to be used.

Roy Kaufman

Roy Kaufman

Roy Kaufman is Managing Director of both Business Development and Government Relations for the Copyright Clearance Center (CCC). Prior to CCC, Kaufman served as Legal Director, John Wiley and Sons, Inc. He is a member of, among other things, the Bar of the State of New York, the Author’s Guild, and the editorial board of UKSG Insights. Kaufman also advises the US Government on international trade matters through membership in International Trade Advisory Committee (ITAC) 13 – Intellectual Property and the Library of Congress’s Copyright Public Modernization Committee in addition to serving on the Board of the United States Intellectual Property Alliance (USIPA).

Discussion

2 Thoughts on "AI Rights Reservation: Human Readable is Machine Readable — An Interview with Haralambos (“Babis”) Marmanis"

Everything human-readable is machine-readable is a fine concept. But there are individuals who have been stretching the particular clause.

For example, one person has suggested that placing “All Rights Reserved” on their profile is sufficient to trigger this clause.

However, if a machine is to read one of their individual published materials, that “All Rights Reserved” is not visible. It is neither machine readable, nor human readable.
Their argument is that a human would seek information on who published that material, visit their profile, and from there understand that the published material is subject to the “All Rights Reserved” that is present on their profile, and therefore it should apply to ‘the machine’ as well.

Another person took this even further. Although they have a presence on many platforms, they point to the existence of an ‘official profile’ on twitter (aka X) with “©Individual, All Rights Reserved”, which is linked to on those other platforms, as being sufficient to cover all those other platforms with the same disclaimer.

I wonder what Mr. Marmanis’s thoughts on this are, as these are many steps removed from what might traditionally be considered an attachment of such information to the published work (i.e. via metadata on the work itself, a caption for a visual work or mention in audio works).
Note that I’m not asking about the attachment of certain rights to the work, but rather the stance of whether or not this should be considered machine-readable for the purposes of the EU DSM clause discussed.

I can’t tell if the content of this post is “all rights reserved”. It says that term everywhere, but those are examples and I’m not sure if the author also intends for them to apply to this article. Maybe somewhere in the machine-readable metadata of this page it is more clear.

Comments are closed.