I have had people tell me with doctrinal certainty that Creative Commons licenses allow text and data mining, and insofar as license terms are observed, I agree. The making of copies to perform text and data mining, machine learning, and AI training (collectively “TDM”) without additional licensing is authorized for commercial and non-commercial purposes under CC BY, and for non-commercial purposes under CC BY-NC. (Full disclosure: CCC offers RightFind XML, a service that supports licensed commercial access to full-text articles for TDM with value-added capabilities.)

I have long wondered, however, about the interplay between the attribution requirement (i.e., the “BY” in CC BY) and TDM. After all, the bargain with those licenses is that the author allows reuse, typically at no cost, but requires attribution. Attribution under the CC licenses may be the author’s primary benefit and motivation, as few authors would agree to offer the licenses without credit.

In the TDM context, this raises interesting questions:

  • Does the attribution requirement mean that the author’s information may not be removed as a data element from the content, even if inclusion might frustrate the TDM exercise or introduce noise into the system?
  • Does the attribution need to be included in the data set at every stage?
  • Does the result of the mining need to include attribution, even if hundreds of thousands of CC BY works were mined and the output does not include content from individual works?

While these questions may have once seemed theoretical, that is no longer the case. An analogous situation involving open software licenses (GNU and the like) is now being litigated. On November 4th, a class action lawsuit — Doe 1 v. GitHub Inc., N.D. Cal., No. 3:22-cv-06823, 11/3/22 — was filed in the US District Court in the Northern District in California, alleging against Microsoft and GitHub (a Microsoft subsidiary), inter alia: violation of the DMCA; breach of contract; tortious interference in a contractual relationship; unjust enrichment; unfair competition; violation of California Consumer Privacy Act; and negligence. Also sued were a confusing mishmash of for profit and non-profit related entities all using a variation of the name OpenAI (OpenAI, Inc., OpenAI, LLC, OpenAI Startup Fund GP I, L.L.C.; you get the picture). OpenAI received one billion dollars in funding from Microsoft although they seem “officially unrelated.”

gavel on top of laptop keyboard

As of this writing, this case is at the earliest stage and has a long way to go before there is any sort of result. But the issues it raises are significant, especially for authors who have published content under so-called “open licenses” with attribution requirements.

Let’s start with the lawsuit’s basics. GitHub is a hosting platform commonly used to share open-source code. The impulse to create and use open source code is  reasonable and has some social utility. Many of the processing tasks contemporary software engineers are called upon to create are repetitive, and relatively well-known and understood in the literature. Open source programming is one means of addressing the burden this represents. Basically, there’s no need to keep reinventing the wheel.

Over time, licenses were developed to standardize and better manage reuse rights in code developed and deployed this way. If the code came without restrictions, that would be the easiest in terms of reuse. But people like credit for their work, even in the open world. Some open code carries relatively light requirements, for example: “Don’t use my code commercially (don’t sell it or use it in something you sell)” and, very basically, “Acknowledge my contribution (keep my name on my work).” These types of requirements are familiar in our industry given widespread use of Creative Commons licenses.

Plaintiffs allege that OpenAI and GitHub assembled and distributed a commercial product called Copilot to create generative code using publicly accessible code originally made available under various “open source”-style licenses, many of which include an attribution requirement. As GitHub states, “…[t]rained on billions of lines of code, GitHub Copilot turns natural language prompts into coding suggestions across dozens of languages.” The resulting product allegedly omitted any credit to the original creators.

Open licenses have tended to be looked upon by users as a free-for-all, without adequate attention to the very real concerns of the creators. In this case, the sheer scale of the alleged violation in terms of works used may well form the basis of the defense. “Your honor, we needed so many works that it was simply not practical to ask permission of the creators.” I don’t find this argument convincing given the ability today to license many content types at scale for TDM, including images, music and yes, journal articles (See “Full disclosure” above), but it is an argument often offered by infringers.

Open licenses can create a very practical challenge for users who go beyond the terms. Mining is a legitimate use of content under a CC BY license, but if you need permission from authors to, e.g., not include attribution information, time and effort may be needed. With journals, some publishers require authors to sign copyright agreements even for content that is then published under an open license. This practice creates a single point of contact for uses that may not fit within the CC lines. Of course with the expansion of rights retention strategies, the problem of contacting all authors only becomes worse.

As a final note, the complaint alleges a violation under the Digital Millennium Copyright Act for removal of copyright notices, attribution, and license terms, but conspicuously does not allege copyright infringement. A material breach of a copyright license can give rise to an infringement claim, so this is an interesting move. While the plaintiffs’ attorney indicated that an infringement claim might be added later, I suspect that this was done to avoid a messy fair use dispute. The complaint includes a statement by GitHub asserting an expansive, almost global fair use assertion which is at odds with explicit relevant law in many countries and frankly at odds even with US law. Nonetheless, fair use as a defense is expensive and complicated to litigate, so perhaps they chose to focus on something that is beyond factual dispute, and still provides the same damages.

Roy Kaufman

Roy Kaufman

Roy Kaufman is Managing Director of both Business Development and Government Relations for the Copyright Clearance Center (CCC). Prior to CCC, Kaufman served as Legal Director, John Wiley and Sons, Inc. He is a member of, among other things, the Bar of the State of New York, the Author’s Guild, and the editorial board of UKSG Insights. Kaufman also advises the US Government on international trade matters through membership in International Trade Advisory Committee (ITAC) 13 – Intellectual Property and the Library of Congress’s Copyright Public Modernization Committee in addition to serving on the Board of the United States Intellectual Property Alliance (USIPA).


14 Thoughts on "GitHub is Sued, and We May Learn Something About Creative Commons Licensing"

I have to criticize your “damning with faint praise” to say open source is “reasonable” and has “some social utility”. The vast majority of the entire Internet, that is, the servers, run on open source Linux and 85% of smartphones worldwide use Linux. Approximately 25% of top universities worldwide use open source Moodle as their course management system. Somewhere between 500K and 1.7M websites use open source Drupal. Most experts in election security call for the software on election machines to be open source because it’s more secure to have many thousands of eyes scrutinizing the code for exploitable bugs. One could argue the “social utility” is much higher for open source code than for proprietary software.

Thanks Melissa- That is a fair point. It wasn’t my intention to minimize the social utility of open source software.


This is very interesting, if an author/creator assigns their copyright to a publishing house and the work is then made available under a CC BY license, then the publisher would be the one to decide if they were willing to waive the attribution element?
Presumably with rights retention the author would make the same decision for an earlier version?

If you waive your right to attribution, you are changing the license for the product. And that would apply to all versions going forward.

Under Creative Commons licenses, you may waive requirements such as attribution for individual reusers.

Section 8c in the CC-BY 4.0 license says “No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor.” That means that terms and conditions may be waived if agree to by the license holder.

Hi, Matt. That section just affirms changes can only be made by the copyright holder. It doesn’t clarify what happens when the license is diluted for one use.

Hi Hugh, section 6C of the license makes clear that the copyright holder agreeing separate terms, such as a waiver of the need for attribution, has no effect on the general applicability of the irrevocable Creative Commons license: “For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License.”

It will be interesting to see the outcome of the lawsuit against Apache. Their 2.0 license has a “poison pill” clause.
3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.

The lawsuit is assuming GitHub even needs the license to do what they did. Their terms of service should be the governing license, since the authors posted the code to GitHub after agreeing to the TOS.

Copyright still applies. You have no rights over source or binary form of a software except those explicitly granted by law or the license. Github ToS merely require you give them the right to archive and display your content, not an unlimited license grant.

Their platform. Their rights. If you’re worried about your junk code. Take it somewhere else.

My internet, my rights. Take your spurious argument somewhere else.

They have no rights over the software hosted on their platform, copyright still applies.

Comments are closed.