Sai Mamidala | Is the Use of Copyrighted Works to Train AI Models Protected Under Fair Use?

Apr 19

Background

Major AI companies around the world train their models on copyrighted material. The reason is simple: an AI model is only as good as the data used to train it, and building a useful model requires an enormous quantity of that data – far more than any company can feasibly produce on its own. So, companies like Anthropic, Google, and Meta turned to what was already available: books, articles, images, music, and code created by other people. While the training process involves models ingesting entire original works, these aren’t retained or even retrievable by the model. Rather, the process extracts statistical patterns about language, like relational vocabulary, syntax, and structure, and compresses them into numerical parameters to power future outputs. That mechanical reality is what makes the transformative fair use question surrounding AI so interesting, and the federal courts that have tried to answer this question have landed in strikingly different places. Because AI’s recent, explosive growth has outpaced the litigation surrounding it, we have an emerging split headed for circuit-level resolution rather than a mature circuit split.

The analytical framework governing these cases is copyright law’s fair use doctrine, codified at 17 U.S.C. §107, which provides that “the fair use of a copyrighted work… for purposes such as criticism, comment, news reporting, teaching, … scholarship, or research, is not an infringement of copyright.” Courts evaluate fair use on a case-by-case basis weighing four factors set forth in Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569, 579 (1994). The first, and often most consequential, factor asks about whether the purpose and character of the use is “transformative:” whether the new work adds “something new, with a further purpose or different character” rather than merely substituting for the original. The second factor considers the nature of the copyrighted work, such as whether it is creative or informational. The third factor looks at how much of the original work was used. Finally, the fourth factor, the other heavyweight, is the effect of the use on the market for the original work. Although no single factor is dispositive, every factor is contested in the AI context: Is training a model on a literary work “transformative” if it can then generate similar, competing works? Does copying an entire work point to fair use when the copy is never reproduced in the output? And which market is in question, the market for the original work, or a brand new market for AI-generated work?

In 2023, the Supreme Court recalibrated the transformative use inquiry (Factor 1) in Andy Warhol Found. for the Visual Arts, Inc. v. Goldsmith, 598 U.S. 508 (2023). There, the Court held that merely adding new aesthetics or meaning does not render a use “transformative”. The inquiry focuses on whether the new work serves a fundamentally different purpose from the original or just operates as a substitute for it. Warhol thus placed new emphasis on the relationship between the original work and the allegedly infringing new use, rather than on the subjective intent of the copier. In training AI, this relationship is particularly complex, as outputs bear no significant resemblance to the original works consumed during training.

The question dividing federal district and appellate courts today is how the Campbell factors and Warhol framework apply to copyrighted works used to train AI. The emerging split between the Northern District of California in the 9th Circuit and the District of Delaware in the 3rd Circuit reveals a sharp disagreement not just over outcomes, but over what Warhol actually requires courts to ask. The Third Circuit, now considering the issue on interlocutory appeal, is poised to become the first appellate court to weigh in.

Issue

Does the use of copyrighted works to train AI models constitute fair use under 17 U.S.C. §107?

The Split

The “No Fair Use” Side

In Thomson Reuters Enterprise Centre GmbH v. ROSS Intelligence Inc., 765 F. Supp. 3d 382 (D. Del. 2025), Judge Stephanos Bibas granted partial summary judgment to Thomson Reuters, finding that ROSS Intelligence’s use of Westlaw headnotes to train its AI-powered legal research tool was not fair use.

The facts here are unusual. ROSS, founded in 2014, wasn’t building generative AI – it was building a legal research search engine that would allow lawyers to type questions in plain English and get relevant case law back. To train the system, ROSS needed legal text. It tried to license content directly from Thomson Reuters (Westlaw’s parent company), but Thomson Reuters’ policy bars the use of Westlaw to build a competing product. To circumvent this, ROSS approached a third-party company, LegalEase Solutions, and hired them to create “Bulk Memos” – sets of legal question-and-answer pairs written by lawyers. The catch was that these Bulk Memos used Westlaw headnotes as a reference when writing those questions. Thomson Reuters subsequently sued for copyright infringement.

On Warhol Factor 1, Judge Bibas found ROSS’s use non-transformative. Because both ROSS and Thomson Reuters helped navigate case law using headnotes, their products served very similar purposes to the same audience. ROSS did argue that its copying should be permissible because the headnotes were converted to numerical data during training and never appeared in the final product, but Judge Bibas rejected this, noting that unlike earlier computer code copyright cases, this type of copying was not required for ROSS to compete. ROSS could have built its training data without copying Thomson Reuters’ headnotes – it just would have taken far more effort and expense. It’s worth noting that ROSS was a small, venture-backed startup that has since shut down – in part because of the cost of this very litigation. Additionally, Judge Bibas did leave open the question of whether generative AI would be subject to the same result, so we may see genAI faring differently in an analogous situation.

Factors 2 and 3 were decided in favor of ROSS. On Factor 2, Judge Bibas acknowledged the headnotes are “not that creative,” recognizing that headnote writers are heavily constrained by the judicial opinions they summarize. This is what copyright scholars call “thin” protection: the headnotes only just barely cross the originality threshold (a requirement of copyrightable material) set by Feist Publications, Inc. v. Rural Telephone Service Co., 499 U.S. 340 (1991). On Factor 3, ROSS’s final product never displayed a single Westlaw headnote to users – it surfaced only judicial opinions, which are freely available to the public and uncopyrightable. Judge Bibas found that “what matters is not the amount and substantiality of the portion used in making a copy, but rather the amount and substantiality of what is thereby made accessible to a public.” Because no headnote expression ever reached ROSS’s users, this factor favored ROSS.

On Factor 4, Judge Bibas found that ROSS “meant to compete with Westlaw by developing a market substitute,” with its use threatening both the existing legal research market and a potential derivative market for AI training data. This resulted in a 2-2 split across the factors. Employing the Campbell/Warhol balancing test and knowing that Factors 1 and 4 (4 especially) are most significant, ROSS’ fair use defense failed, and Judge Bibas granted summary judgment to Thomson Reuters on fair use.

This result has drawn scholarly criticism. An amicus brief filed by leading copyright scholars argues that Judge Bibas’ fair use analysis is “unsound”: that he wrongly equated business competition with expressive substitution under Warhol, that the headnotes deserve at most thin copyright protection, and that ROSS’s tool serves the public interest by expanding access to judicial opinions.

The question has now been taken up on interlocutory appeal under 28 U.S.C. §1292(b), certified by Judge Bibas himself. He advances two questions: (1) whether Westlaw’s headnotes are sufficiently original to warrant copyright protection, and (2) whether ROSS’s use of them constituted fair use. This is an issue of first impression for any federal appellate court, so we will be looking to the Third Circuit for the first governing framework on AI training and fair use.

The “Fair Use” Side

Meanwhile, the Northern District of California reached the opposite conclusion – twice.

In Bartz v. Anthropic PBC, No. 3:24-cv-05417 (N.D. Cal. June 23, 2025), Judge William Alsup found that Anthropic’s use of copyrighted books, many pirated from online repositories like Library Genesis (LibGen), to train its Claude LLM was fair use. On Factor 1, Judge Alsup held that AI training was “exceedingly transformative.” The purpose of a book, he reasoned, is to read and experience its expression; the purpose of training a language model on that book is to extract statistical patterns about language to build a tool that generates new text. Those are fundamentally different functions. Judge Alsup analogized the process to a human reader aspiring to be a writer, who internalizes books not “to race ahead and replicate or supplant them – but to turn a hard corner and create something different.” Anthropic’s (and any mainstream) LLM similarly learns from individual authors’ creative expressions but does not output anything from “a given work’s creative elements, nor even one author’s identifiable expressive style.” Moreover, copyright protection does not extend to “method[s] of operation, concept[s], [or] principle[s]” “illustrated[] or embodied in [a] work”, leaning in favor of fair use. 17 U.S.C. § 102(b).

On Factor 4, Judge Alsup found no proven market harm: the AI is not a substitute for the books it was trained on, given that the books aren’t directly accessible through the LLM. Interestingly, the court rejected the argument that authors are entitled to exploit a licensing market specifically for AI training data (the right to charge AI companies for permission to train on their works.) Even assuming such a market could develop, Judge Alsup concluded, it is “not one the Copyright Act entitles Authors to exploit” – a notable departure from Judge Bibas’s willingness in ROSS to credit a hypothetical training-data licensing market under Factor 4 analysis. Separately, Judge Alsup found that Anthropic’s downloading of millions of books from pirate libraries was not fair use. This led to a $1.5 billion settlement, believed to be the largest in U.S. copyright history, and a reminder that even if training is legal, proper data sourcing is still crucial.

Just two days later, in Kadrey v. Meta Platforms, Inc., No. 3:23-cv-03417 (N.D. Cal. June 25, 2025), Judge Vince Chhabria found Meta's use of books to train its models “highly transformative” – but unlike Judge Alsup, he refused to let that finding be determinative. While Judge Chhabria agreed with Judge Alsup that training serves a fundamentally different function than the original works, he agreed with the spirit of Judge Bibas’s concern: that AI models trained on copyrighted books have an unprecedented capacity to generate competing content at scale, potentially “flood[ing] the market” with works that may displace human authors. Judge Chhabria found no proven market harm from the evidence on the current record, but made clear that future litigation that provided such evidence may see another result even where training is highly transformative. His opinion thus crystallizes the core tension the Third Circuit will have to resolve: can a use be transformative under Factor 1 and still fail on Factor 4?

Looking Forward

Oral argument for ROSS is yet to be scheduled, but when the Third Circuit does rule, it will be the first appellate framework for AI training under §107. But as the California decisions suggest, it won’t be the last word. The AI copyright docket now includes upwards of 90 cases, and the next round of fair use rulings is already on the horizon. Concord Music Grp., Inc. v. Anthropic PBC, No. 3:23-cv-01092 (M.D. Tenn.), which asks whether AI-generated song lyrics infringe the copyrighted music the model trained on, has a summary judgment hearing set for July 2026. In re Google Generative AI Copyright Litig., No. 3:23-md-03061 (N.D. Cal.) is on a similar timeline. Some rightsholders aren’t waiting for courts at all – Universal Music Group settled with AI music generator Udio; and Suno, a similar platform to Udio, also recently settled with Warner Music Group – both opting for licensing deals over continued litigation.

What is becoming clear is that this won’t resolve into a single yes-or-no answer. As Emory Law Professor Matthew Sag argues in his forthcoming Duke Law Journal article, Copyright’s Jagged Frontier, the line between permissible and infringing AI will be “jagged” – not a clean rule, but an irregular, context-dependent boundary shaped by how much any given model memorizes, how similarity is measured across different creative media, and how fair use, substantial similarity, and secondary liability interact. It may also be far easier for an AI system to cross into infringement territory when generating music or images of recognizable characters than when generating prose, even when the underlying technology is identical. The cases in this piece are the first contour of that frontier. The rest is still being drawn.

Find Sai on LinkedIn!

Artificial IntelligenceCopyright LawCopyright InfringementAI Models

Emory ELSSCAP

Sai Mamidala | Is the Use of Copyrighted Works to Train AI Models Protected Under Fair Use?

IzJanae Soler | Can a Plaintiff Demonstrate Title IX Sex Discrimination by Showing an Erroneous Outcome?