Monday October 30 was the final date for interested parties to submit comments to a comprehensive “Notice of inquiry and request for comments” issued by the United States Copyright Office entitled “Artificial Intelligence and Copyright.” With 34 questions asked about both copyright and technology, some parties responses exceed 100 pages. More than 9,000 responses have been filed. On the assumption that Scholarly Kitchen readers might be interested in this topic and less interested in reviewing all the responses, I have pasted below a selection of questions and answers from Copyright Clearance Center’s (CCC’s) own response.
2. Does the increasing use or distribution of AI-generated material raise any unique issues for your sector or industry as compared to other copyright stakeholders?
AI generated materials may both advance text publishing and hinder it. In sectors such as science, news, and book publishing, poor quality AI materials can generate bad science, promote misinformation, and lead to harmful results. This is not to say that such is the inevitable result of all AI; merely that it is a meaningful risk with respect to certain AI applications. AI can advance text publishing by providing tools for writing, checking, validating, and improving text-based works. It is also useful for primary research that may result in the creation of new content.
6.1. How or where do developers of AI models acquire the materials or datasets that their models are trained on? To what extent is training material first collected by third-party entities (such as academic researchers or private companies)?
In the text sector, developers of AI models — when acting lawfully — acquire materials and data sets from publishers, other rightsholders, websites that allow crawling, intermediaries, and aggregators (such as CCC). Significant amounts of content are available through licenses, including open licenses such as CC BY and CC BY-NC. Significant amounts of content are also available through the public domain. When acting unlawfully, AI developers receive materials from pirate sites, through downloading in violation of express terms and flags, and from so-called “shadow libraries,” among other things.
6.2. To what extent are copyrighted works licensed from copyright owners for use as training materials? To your knowledge, what licensing models are currently being offered and used?
Copyrighted materials are licensed for AI use directly by rightsholders and collectively through rights aggregators such as CCC. CCC’s collective licenses are non-exclusive, global, and fully voluntary. Our current AI-related offerings are focused on the corporate, research, academic and education markets.
Additionally, in science publishing, under “open access” business models, copyright owners employ open licensing which sometimes allows licensed reuse for AI under the terms of such licenses. According to this report, open models accounted for 31% of articles, reviews and conference papers in 2021.
6.4. Are some or all training materials retained by developers of AI models after training is complete, and for what purpose(s)? Please describe any relevant storage and retention practices.
Humans communicate in natural language by placing words in sequences; the rules about what the sequencing and specific form of a word are dictated by the specific language (e.g., English). An essential part of the architecture for all software systems (and therefore AI systems) that process text is how to represent that text so that the functions of the system can be performed most efficiently.
Almost all large language models are based on the “transformer architecture,” which invokes the “attention mechanism.” The latter is a mechanism that allows the AI technology to view entire sentences, and even paragraphs, at once rather than as a mere sequence of characters. This allows the software to capture the various contexts within which a word can occur.
Therefore, a key step in the processing of a textual input in language models is the splitting of the user input into special “words” that the AI system can understand. Those special words are called “tokens.” The component that is responsible for that is called a “tokenizer.” There are many types of tokenizers. For example, OpenAI and Azure OpenAI use a subword tokenization method called “Byte-Pair Encoding (BPE)” for its Generative Pretrained Transformer (GPT)-based models. BPE is a method that merges the most frequently occurring pairs of characters or bytes into a single token, until a certain number of tokens or a vocabulary size is reached. The larger the vocabulary size, the more diverse and expressive the texts that the model can generate.
Once the AI system has mapped the input text into tokens, it encodes the tokens into numbers and converts the sequences (even up to multiple paragraphs) that it processed as vectors of numbers that we call “word embeddings.” These are vector-space representations of the tokens that preserve their original natural language representation that was given as text. It is important to understand the role of word embeddings when it comes to copyright because the embeddings are the representations (or encodings) of entire sentences, paragraphs, and even documents, in a high-dimensional vector space. It is through the embeddings that the AI system captures and stores the meaning and the relationships of the words from the natural language.
Embeddings are used in practically every task that a generative AI system performs (e.g., text generation, text summarization, text classification, text translation, image generation, code generation, and so on).
Word embeddings are usually stored in vector databases but a detailed description of all the approaches to storage is beyond the scope of this response since there is a wide variety of vendors, processes, and practices that are in use.
8. Under what circumstances would the unauthorized use of copyrighted works to train AI models constitute fair use? Please discuss any case law you believe relevant to this question.
U.S. law has no specific rules governing the use of copyrighted materials to train AI. Rather, such uses fall under the general copyright regime. Under U.S. law, copying copyrighted content to train AI can state a cause of action for infringement [Citing, Thomson Reuters Enters. Ctr. GmbH v. ROSS Intelligence Inc., 529 F.Supp.3d 303 (D. Del. 2021) (downloading and copying of Westlaw database for the purpose of training AI).] Thus, such activities require a license to be non-infringing unless they fall under the fair use exception.
The application of fair use to an infringement is fact dependent. Copying for purposes of training an AI will usually entail copying the complete work. Whether the copying is for commercial or non-commercial research purposes will be considered. The courts will also look very closely at market harm under the fourth factor. As stated by the Supreme Court in Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 590 (1994) “[the fourth factor] requires courts to consider not only the extent of market harm caused by the particular actions of the alleged infringer, but also ‘whether unrestricted and widespread conduct of the sort engaged in by the defendant … would result in a substantially adverse impact on the potential market’ for the original.” And, as reinforced by the recent Supreme Court decision in Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith, 598 U.S. (2023), the impact of the infringing use on licensing is one of the key factors in determining market harm.
Relevant instructional cases include the cases mentioned above as well as Fox News Network, LLC v. TVEyes, Inc., 883 F.3d 169 (2d Cir. 2018), where the Second Circuit Court of Appeals rejected a fair use defense in a case of allegedly transformative compiling of recorded broadcasts into text searchable databases that allowed search and viewing of short excerpts. By contrast, the Second Circuit had previously considered the text mining of scanned books for non-commercial social science research in Authors Guild v. Google, Inc. 721 F.3d 132 (2d Cir. 2015), and held that copies made and used for a specific purpose involving snippets would likely fall under fair use.
There are currently multiple pending cases in the U.S. relating to use of copyrighted content for the development of AI systems. Congress has expressed interest in the issue by including language in the SAFE Innovation Framework that the Framework will “support our creators by addressing copyright concerns, protect intellectual property, and address liability.”
9. Should copyright owners have to affirmatively consent (opt in) to the use of their works for training materials, or should they be provided with the means to object (opt out)?
Copyright is, and should remain, an opt in regime. Placing the burden of asserting rights on the copyright holders is inequitable, burdensome, and largely impractical. Only those making copies know what they are copying in the first instance and thus the copyright owners are not in a position to opt out.
9.2. If an ‘‘opt out’’ approach was adopted, how would that process work for a copyright owner who objected to the use of their works for training? Are there technical tools that might facilitate this process, such as a technical flag or metadata indicating that an automated service should not collect and store a work for AI training uses?
There is good reason that copyright is an “opt in” regime. Some AI developers have gathered content by routinely ignoring flags, copyright notices and metadata. Thus, while there are protocols and flags that can be used and are used by rightsholders and honored by ethical AI developers, they are no substitute for placing the responsibility for compliance on the user. Moreover, requiring flags and metadata assumes that the content resides on a server or website under the control of the rightsholder. This is not always true. For example, in the recent case of Am. Soc’y for Testing & Materials v. Public.Resource.Org, Inc., 82 F.4th 1262 (D.C. Cir. 2023), the Court of Appeals for the District of Columbia Circuit ruled that the non-commercial posting of technical standards incorporated into reference by law is fair use. It would be problematic to assume that the entity posting the standards over the objection of copyright owners would take steps to reserve the copyright owner’s AI rights.
Finally, for smaller creators, any obligation to adopt technical protection measures or flags is unfair and unduly burdensome.
Technical flags and metadata are useful for AI developers who act ethically and have another great value; where ignored by AI developers they can provide evidence of willfulness.
9.3. What legal, technical, or practical obstacles are there to establishing or using such a process? Given the volume of works used in training, is it feasible to get consent in advance from copyright owners?
It is feasible to acquire advance consent of copyright owners. It is not feasible to place the burden on rightsholders to police their rights without knowing who is using their works without authorization and how the works are being used.
The burden of implementing technical measures, flags, and metadata may, depending on the sector, be involved, complicated and costly to copyright owners. In the recent past, international sector-wide initiatives such as ACAP have absorbed significant time and resources on the part of rightsholders and users seeking to act ethically, only to be rejected by the tech industry. Current efforts of note include the W3C Text and Data Mining Rights Reservation Protocol.
As noted above, as a practical matter, a copyright holder may have no control over websites where its content is held. This is especially true where content is posted in violation of copyright or under a copyright exception.
There is certainly enough copyrightable material available under license to build reliable, workable, and trustworthy AI. Just because a developer wants to use “everything” does not mean it needs to do so, is entitled to do so, or has the right to do so. Nor should governments and courts twist or modify the law to accommodate them.