Monday October 30 was the final date for interested parties to submit comments to a comprehensive “Notice of inquiry and request for comments” issued by the United States Copyright Office entitled “Artificial Intelligence and Copyright.” With 34 questions asked about both copyright and technology, some parties responses exceed 100 pages. More than 9,000 responses have been filed. On the assumption that Scholarly Kitchen readers might be interested in this topic and less interested in reviewing all the responses, I have pasted below a selection of questions and answers from Copyright Clearance Center’s (CCC’s) own response.

3D illustration of a robot working at a laptop surrounded by copyright symbols

2. Does the increasing use or distribution of AI-generated material raise any unique issues for your sector or industry as compared to other copyright stakeholders?

AI generated materials may both advance text publishing and hinder it. In sectors such as science, news, and book publishing, poor quality AI materials can generate bad science, promote misinformation, and lead to harmful results. This is not to say that such is the inevitable result of all AI; merely that it is a meaningful risk with respect to certain AI applications. AI can advance text publishing by providing tools for writing, checking, validating, and improving text-based works. It is also useful for primary research that may result in the creation of new content.

6.1. How or where do developers of AI models acquire the materials or datasets that their models are trained on? To what extent is training material first collected by third-party entities (such as academic researchers or private companies)?

In the text sector, developers of AI models — when acting lawfully — acquire materials and data sets from publishers, other rightsholders, websites that allow crawling, intermediaries, and aggregators (such as CCC). Significant amounts of content are available through licenses, including open licenses such as CC BY and CC BY-NC. Significant amounts of content are also available through the public domain. When acting unlawfully, AI developers receive materials from pirate sites, through downloading in violation of express terms and flags, and from so-called “shadow libraries,” among other things.

6.2. To what extent are copyrighted works licensed from copyright owners for use as training materials? To your knowledge, what licensing models are currently being offered and used?

Copyrighted materials are licensed for AI use directly by rightsholders and collectively through rights aggregators such as CCC. CCC’s collective licenses are non-exclusive, global, and fully voluntary. Our current AI-related offerings are focused on the corporate, research, academic and education markets.

Additionally, in science publishing, under “open access” business models, copyright owners employ open licensing which sometimes allows licensed reuse for AI under the terms of such licenses. According to this report, open models accounted for 31% of articles, reviews and conference papers in 2021.

6.4. Are some or all training materials retained by developers of AI models after training is complete, and for what purpose(s)? Please describe any relevant storage and retention practices.

Humans communicate in natural language by placing words in sequences; the rules about what the sequencing and specific form of a word are dictated by the specific language (e.g., English). An essential part of the architecture for all software systems (and therefore AI systems) that process text is how to represent that text so that the functions of the system can be performed most efficiently.

Almost all large language models are based on the “transformer architecture,” which invokes the “attention mechanism.” The latter is a mechanism that allows the AI technology to view entire sentences, and even paragraphs, at once rather than as a mere sequence of characters. This allows the software to capture the various contexts within which a word can occur.

Therefore, a key step in the processing of a textual input in language models is the splitting of the user input into special “words” that the AI system can understand. Those special words are called “tokens.” The component that is responsible for that is called a “tokenizer.” There are many types of tokenizers. For example, OpenAI and Azure OpenAI use a subword tokenization method called “Byte-Pair Encoding (BPE)” for its Generative Pretrained Transformer (GPT)-based models. BPE is a method that merges the most frequently occurring pairs of characters or bytes into a single token, until a certain number of tokens or a vocabulary size is reached. The larger the vocabulary size, the more diverse and expressive the texts that the model can generate.

Once the AI system has mapped the input text into tokens, it encodes the tokens into numbers and converts the sequences (even up to multiple paragraphs) that it processed as vectors of numbers that we call “word embeddings.” These are vector-space representations of the tokens that preserve their original natural language representation that was given as text. It is important to understand the role of word embeddings when it comes to copyright because the embeddings are the representations (or encodings) of entire sentences, paragraphs, and even documents, in a high-dimensional vector space. It is through the embeddings that the AI system captures and stores the meaning and the relationships of the words from the natural language.

Embeddings are used in practically every task that a generative AI system performs (e.g., text generation, text summarization, text classification, text translation, image generation, code generation, and so on).

Word embeddings are usually stored in vector databases but a detailed description of all the approaches to storage is beyond the scope of this response since there is a wide variety of vendors, processes, and practices that are in use.

8. Under what circumstances would the unauthorized use of copyrighted works to train AI models constitute fair use? Please discuss any case law you believe relevant to this question.

U.S. law has no specific rules governing the use of copyrighted materials to train AI. Rather, such uses fall under the general copyright regime. Under U.S. law, copying copyrighted content to train AI can state a cause of action for infringement [Citing, Thomson Reuters Enters. Ctr. GmbH v. ROSS Intelligence Inc., 529 F.Supp.3d 303 (D. Del. 2021) (downloading and copying of Westlaw database for the purpose of training AI).] Thus, such activities require a license to be non-infringing unless they fall under the fair use exception.

The application of fair use to an infringement is fact dependent. Copying for purposes of training an AI will usually entail copying the complete work. Whether the copying is for commercial or non-commercial research purposes will be considered. The courts will also look very closely at market harm under the fourth factor. As stated by the Supreme Court in Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 590 (1994) “[the fourth factor] requires courts to consider not only the extent of market harm caused by the particular actions of the alleged infringer, but also ‘whether unrestricted and widespread conduct of the sort engaged in by the defendant … would result in a substantially adverse impact on the potential market’ for the original.” And, as reinforced by the recent Supreme Court decision in Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith, 598 U.S. (2023), the impact of the infringing use on licensing is one of the key factors in determining market harm.

Relevant instructional cases include the cases mentioned above as well as Fox News Network, LLC v. TVEyes, Inc., 883 F.3d 169 (2d Cir. 2018), where the Second Circuit Court of Appeals rejected a fair use defense in a case of allegedly transformative compiling of recorded broadcasts into text searchable databases that allowed search and viewing of short excerpts. By contrast, the Second Circuit had previously considered the text mining of scanned books for non-commercial social science research in Authors Guild v. Google, Inc. 721 F.3d 132 (2d Cir. 2015), and held that copies made and used for a specific purpose involving snippets would likely fall under fair use.

There are currently multiple pending cases in the U.S. relating to use of copyrighted content for the development of AI systems. Congress has expressed interest in the issue by including language in the SAFE Innovation Framework that the Framework will “support our creators by addressing copyright concerns, protect intellectual property, and address liability.”

9. Should copyright owners have to affirmatively consent (opt in) to the use of their works for training materials, or should they be provided with the means to object (opt out)?

Copyright is, and should remain, an opt in regime. Placing the burden of asserting rights on the copyright holders is inequitable, burdensome, and largely impractical. Only those making copies know what they are copying in the first instance and thus the copyright owners are not in a position to opt out.

9.2. If an ‘‘opt out’’ approach was adopted, how would that process work for a copyright owner who objected to the use of their works for training? Are there technical tools that might facilitate this process, such as a technical flag or metadata indicating that an automated service should not collect and store a work for AI training uses?

There is good reason that copyright is an “opt in” regime. Some AI developers have gathered content by routinely ignoring flags, copyright notices and metadata. Thus, while there are protocols and flags that can be used and are used by rightsholders and honored by ethical AI developers, they are no substitute for placing the responsibility for compliance on the user. Moreover, requiring flags and metadata assumes that the content resides on a server or website under the control of the rightsholder. This is not always true. For example, in the recent case of Am. Soc’y for Testing & Materials v. Public.Resource.Org, Inc., 82 F.4th 1262 (D.C. Cir. 2023), the Court of Appeals for the District of Columbia Circuit ruled that the non-commercial posting of technical standards incorporated into reference by law is fair use. It would be problematic to assume that the entity posting the standards over the objection of copyright owners would take steps to reserve the copyright owner’s AI rights.

Finally, for smaller creators, any obligation to adopt technical protection measures or flags is unfair and unduly burdensome.

Technical flags and metadata are useful for AI developers who act ethically and have another great value; where ignored by AI developers they can provide evidence of willfulness.

9.3. What legal, technical, or practical obstacles are there to establishing or using such a process? Given the volume of works used in training, is it feasible to get consent in advance from copyright owners?

It is feasible to acquire advance consent of copyright owners. It is not feasible to place the burden on rightsholders to police their rights without knowing who is using their works without authorization and how the works are being used.

The burden of implementing technical measures, flags, and metadata may, depending on the sector, be involved, complicated and costly to copyright owners. In the recent past, international sector-wide initiatives such as ACAP have absorbed significant time and resources on the part of rightsholders and users seeking to act ethically, only to be rejected by the tech industry. Current efforts of note include the W3C Text and Data Mining Rights Reservation Protocol.

As noted above, as a practical matter, a copyright holder may have no control over websites where its content is held. This is especially true where content is posted in violation of copyright or under a copyright exception.

There is certainly enough copyrightable material available under license to build reliable, workable, and trustworthy AI. Just because a developer wants to use “everything” does not mean it needs to do so, is entitled to do so, or has the right to do so. Nor should governments and courts twist or modify the law to accommodate them.

Roy Kaufman

Roy Kaufman

Roy Kaufman is Managing Director of both Business Development and Government Relations for the Copyright Clearance Center (CCC). Prior to CCC, Kaufman served as Legal Director, John Wiley and Sons, Inc. He is a member of, among other things, the Bar of the State of New York, the Author’s Guild, and the editorial board of UKSG Insights. Kaufman also advises the US Government on international trade matters through membership in International Trade Advisory Committee (ITAC) 13 – Intellectual Property and the Library of Congress’s Copyright Public Modernization Committee in addition to serving on the Board of the United States Intellectual Property Alliance (USIPA).


7 Thoughts on "The United States Copyright Office Notice of Inquiry on AI: A Quick Take"

If the threat of harmful misinformation is increased due to the prevalent use of AI, could it be argued that is it an ethical imperative to publish scientific or scholarly work under copyrights that allow for AI crawling or reuse to ensure more accurate data gathering?

I am not sure I would say it is the ethical obligation of the publishers, if that is what you are saying. Yes, it would be good to work to combat fake news, fake science, etc., but the ethical obligation needs to rest first with the entities who are creating the problems, namely the AI companies and the distribution platforms. The New York Times and Washington post are not ethically/morally responsible for fake news on social media, and scientific publishers aren’t responsible for pseudo-science. Publishers must maintain controls in their systems (fact checking in news; peer review in journal publishing) and as all systems fail sometimes, they must have processes for corrections (e.g., retractions). Publishers excel at providing a platform for trusted content and many take do steps to combat false narratives. I just do not think they should be obligated to incur costs or give rights unless they want to.

Wow. Lots of questions ahead. Thanks for this info, Roy. With regard to journal articles, I assume the only practical way to train AI legally in, say, the field of chemistry research, is to work through publisher archives. Is this correct? Even though many authors hold the copyright to their papers, only the AAMs for most will be in front of a paywall (except for OA work), scattered around a thousand different repositories. It seems unlikely that a legal AI chemistry training effort can succeed if the trainers first need to approach each and every author individually to see if they are willing to opt-in, and then find each paper one at a time. So if this assumption is correct, and publishers hold the key, then what does an AI training license look like? Do pubishers just charge a fee per paper (and if so, how much)? Do they own a piece of the AI system they help train? And what happens when the knowledge from this system is used to publish new work, which might then be published in OA or by a different publishers?

To poorly answer some of your questions Glenn, I would say that there are a number of licensing models developing around AI applications and use of content. As someone who develops licenses for a living, I would say that “AI” is not one thing, and the that pricing and models in all licenses tend to be governed by factors such as the use cases and the business models of the licensors and licensees. Licensing models that work for open generative AI applications might not be appropriate for closed internal AI use.

As to the issue of acquiring rights from authors as opposed to publishers, you point to something about which I worry. While many people do not think of it this way, one core function of publishers is rights aggregation. Historically, this aggregation has made it relatively easy for a user to at least know who to contact for rights to reuse, eg, a journal. Authors are now under pressure by some in the community to reserve their rights. I do not wish to wade into the pros and cons here, but will note that it can create a complication for users who might need permission to use content but not have the bandwidth to contact each and every author.

Hello, Roy.
I appreciated your post and you make important points. I found a few of the statements regarding copyright as an “opt in” regime somewhat confusing and wonder if restating them might improve the clarity around the issue of AI companies training their systems on copyright-protected materials. It has been my understanding that copyright itself is afforded to all creators of works–that there is no effort required of the copyright holders unless they want to officially register their copyright. Perhaps clarifying that “AI companies should use an opt-in approach for creators to indicate that their work is available for training” is a clearer way to state this idea. Saying that copyright itself is opt-in muddies the concept a bit (in my opinion).

Thanks for this Kimberly. You are entirely correct. The “opt in” and “opt out” questions above came from the US Copyright Office and were specifically about opting in/out of use in AI, not protection. Copyright protection comes into being the moment something capable of being protected is created (“fixed in a tangible medium of expression” is how we say it in the US). Once created, the owner typically must opt in to the use by a third party (e.g., by license) or else the third party is infringing. There are limited exceptions to this rule allowed under international treaty and the Copyright Office was asking the public for its opinion on such exceptions.

Wow, the answers here are certainly one point-of-view. I don’t think I’ve ever seen a less balanced view of what is “lawful” to ingest in training these models. Of course, it makes sense that the CCC would have a strong viewpoint in this area, but it hardly passes the laugh test. Hopefully the chefs will get some alternative views that consider things like benefits to society and how the laws vary by country on this planet.

Comments are closed.