Unpacking the US Copyright Office’s Third Report on Generative AI


By Juliette Groothaert

 

Upon asking DALL-E 3 to “create a scenic view of the sea in the style of Van Gogh”, the image appearing on the right was generated within seconds. When compared to The Starry Night on the left, the stylistic resemblance is immediately apparent: swirling skies, radiating light forms, bold brushstrokes, and bright color contrasts.

Yet, as reminded by Cooper and Grimmelman, “a model is not a magical portal that pulls fresh information from some parallel universe into our own.”

This basic understanding provides critical context for understanding the copyright implications of generative AI. Generative AI models, as sophisticated data-driven structures, operate on mathematical constructs derived wholly from their training datasets. The expanding general usability of these models has only intensified the demand for such datasets. To enhance quality, accuracy, and flexibility, industry submissions confirm these systems typically require ‘millions or billions of works for training purposes,’ including terabyte-scale datasets for foundation models. As a result, this reliance on pre-existing copyrighted materials has catalyzed numerous legal challenges.

Several prominent examples include The New York Times v. Microsoft Corp case involving unauthorised use of proprietary journalism to train language models; visual arts disputes such as Zhang v. Google LLC, Andersen v. Stability AI, and Getty Images v. Stability AI; and most importantly, the landmark ruling in Thomson Reuters v. Ross Intelligence. In Reuters, although concerning the use of copyrighted legal materials to train a non-generative AI research tool, the court found that copyright infringement had occurred through unauthorised use of legal headnotes and structure to train a competing research tool. Collectively, these cases, which now exceed forty pending lawsuits, center on a pivotal legal question: whether using copyrighted works for AI training is fair use, particularly when employed in generative systems producing output.

Against this contentious backdrop, the United States Copyright Office (‘Office’) advanced this discourse on May 9 2025, by releasing a pre-publication draft of Part 3 of its comprehensive AI policy report. In March 2023, it issued guidance confirming that human authorship is required for copyright registration, and that applicants must disclose any AI-generated content exceeding a de minimis threshold, along with a description of the human author’s contribution. Following this, the Office issued a notice of Inquiry, soliciting public comments on AI and copyright. It received over 10,000 submissions, which informed the analysis and recommendations presented in the current report. Part 1 and Part 2 of the Office’s Initiative, addressing digital replicas and copyrightability respectively, laid essential groundwork for this third report; the Center for Art Law has published further commentary on both which can be found here for Part 1 and here for Part 2. This latest report offers the most detailed articulation yet of how copyright law applies to the training of generative AI models. Yet its release coincides with exceptional institutional turbulence. Register Shira Perlmutter’s dismissal days after the report’s publication raises questions about what changes new management might enact. This timing may be particularly delicate for pending cases like Kadrey v. Meta and Bartz v. Anthropic, which directly echo the report’s analysis. Though the report is not legally binding, it enters a legal ecosystem potentially shaping interpretive norms where AI copyright doctrine is actively evolving.

Table of Contents

Technical Primer

The Office’s pre-publication recognizes that answers to these legal questions must be technically precise regarding how generative AI systems interact with protected works. Before it considers fair use defenses, the Office systematically lays out how machine learning workflows inherently implicate exclusive rights under copyright law. This technical foundation identifies three essential points of pressure: reproduction rights affected when datasets are being created, the possible embodiment of protected expression with model parameters under memorization, and the dangers characteristic of retrieval-augmented generation systems.

Datasets

Generative AI models, including large-scale language models as well as image generators, are developed through machine learning techniques that deliberately reproduce copyrighted material. Every stage of dataset creation is potentially copyright infringement under 17 U.S.C. § 106(1): the initial downloading from online sources, format conversion, cross-medium transfers, and creation of modified subsets or filtered corpora. Such operations may concurrently implicate the derivative work right under § 106(2) when involving recasting or transformation of original expression through abridgements, condensations, or other adaptations.

Model Weights

The Office finds that model weights, numerical parameters encoding learned patterns, may represent copies of protected expression where there is substantial memorization involved, implicating reproduction and derivative rights under copyright law. As articulated on page 30 of its report:

…whether a model’s weights implicate the reproduction or derivative work rights turns on whether the model has retained or memorized substantial protectable expression from the works at issue.’

This determination hinges on a fact-specific inquiry: when weights enable outputting verbatim or near-identical content from training data, the Office asserts there is a strong argument that copying those weights infringes memorized works. Judicial approaches reflect this fact-intensive standard, diverging significantly, as seen in Kadrey v. Meta Platforms dismissing claims as ‘nonsensical’ absent allegations of infringing outputs, while Andersen v. Stability AI permitted claims against third party users where plaintiffs demonstrated protected elements persisted within weights. The Office endorses Andersen’s standard, clarifying that infringement turns on whether ‘the model has retained or memorized substantial protectable expression.’ Critically, when protectable material is embedded in weights, subsequent distribution or reuse, even by parties uninvolved in training, could constitute prima facie infringement, creating downstream liability risks that extend far beyond initial model development.

RAG

The Office’s report adopts a notably more assertive stand on retrieval-augmented (RAG) systems than other AI training methods, focusing particularly on the unique legal risks they present. Unlike conventional generative AI models built up from pre-trained datasets, RAG systems actively retrieve and incorporate real-time data from the outside world during output generation. Accordingly, RAG can be understood as functioning in two steps: the system first copies the source materials into a retrieval database, and then, when prompted by a user query, outputs them again. While such an architecture improves accuracy to reality, both the initial unauthorized reproduction and the later relaying of that material are potential copyright infringements which do not qualify as fair use. These remarks hold especially true when one is summarizing or abridging copyrighted works like news stories rather than merely linking to them.

This categorical stance stems from RAG’s close connection to traditional content markets. With routine AI training, works find their way into the confines of patterns and statistical norms. But RAG outputs retain verbatim excerpts and at times compete directly with originals, threatening core revenue streams for rights holders. For instance, systems found in Perplexity AI, now facing the first US lawsuit targeting RAG technology, allegedly enable users to ‘skip the links’ to go to source material. This diverts traffic and ad revenue away from publishers like The Wall Street Journal that used to bring their reader directly to inside stories through hyperlinks. Unlike established cases like Authors Guild v. Google, RAG itself does not use snippet functions to help people find sources of information. This is where RAG is so different from the past: it risks blending the original and the derived to blur the line between search utility and a competitor commercial service. Having relied heavily upon unauthorized sources, RAG’s activities are a commercial choice rather than one driven by technical necessity because there are feasible alternatives such as licensed APIs. This weakens the argument for fair use as a transformative defence, as RAG’s outputs frequently repeat the expressive purpose and economic value of the underlying works. In essence, the Office’s sharp condemnation of RAG signals a pivotal shift; as licensing markets for training data mature, unlicensed real-time ingestion faces existential legal threats. Cours are increasingly tasked with reconciling innovation incentives with the uncompensated exploitation that drives what some see as RAG’s double-barreled infringement.

Fair Use Factors

The Office’s report thoroughly refutes the assumption that AI training automatically enjoys broad fair use coverage, emphasisng that when it comes to creating datasets from copyrighted works, copying them constitutes prima facie infringement under 17 U.S.C. § 106(1). Against this backdrop, the Office applies the statutory four-factor test under §107 with notable rigour, rejecting categorical exemptions for machine learning. Pre-publication guidance explores these factors in depth under section IV, which will be covered below.

First Factor

The Office’s first factor analysis, centered on the purpose and character of use, applies the Supreme Court’s framework in Warhol v. Goldsmith, rejecting absolute claims of transformativeness and instead demanding that the actualities of use be closely scrutinized. The Office stresses that the potential for transformation cannot be judged purely on how models are trained; instead courts must consider what those trained models do in the field. This approach explicitly incorporates Warhol’s instruction to evaluate the ‘purpose and function’ in relation to original artwork, moving from straightforward textual comparisons of content incorporated or resembled to whether outputs serve as substitutes on the market.

Adam Liptak, Supreme Court Rules Against Warhol Foundation in Prince Case, N.Y. Times (May 18, 2023), https://www.nytimes.com/2023/05/18/us/andy-warhol-prince-lynn-goldsmith.html.

Critically, the report dismantles two key industry arguments. First, that training is a mechanical process that creates non-experiential reality by computer input, and secondly, that it parallels human learning. The Office counters that generative models transform not only semantic meanings but the expensive genre of copyrighted works as well; they study in particular ‘how words are selected and arranged at the sentence, paragraph, and document level.’ This stands in stark contrast to human memory, where learners retain imperfect impressions filtered through unique perspectives. While humans provide the creative ecosystem that the marketplace must have to live off in derivative work, AI reproduces content beyond human speed and scale which enables market-disruptive reproduction.

Further, this analysis spells out protective measures after the deployment and incorporation of data as specific pointers. Proof that the author installed robust guardrails to prevent verbatim output might validate transformativeness by revealing intent that systems be used for different purposes- random objectives can never work, as cautioned in Warhol, and if reality contradicts intention, there is nothing to back it up. Simultaneously, extensive use of pirated datasets weighs against fair use, especially if models generate content competing with the works illegally accessed by trained agents, a reality now germane to ongoing litigation, due largely to most large language models’ dependence on shadow databases.

Ultimately, the Office adopts a nuanced assessment for transformativeness in generative AI. If models are trained on specific genres to produce content for identical audiences, the use is at best moderately transformative given shared commercial and expressive purposes. This calculus weighs input-side considerations (data legality, training indent) against output consequences (market substitution, functional divergence), to ensure transformativeness never outweighs other fair use analysis. As Warhol affirmed and the Office endorses, a transformative use can still infringe upon an original work if it serves the same purpose and market.

Second Factor

The Office’s examination of the second fair use factor, the nature of the copyrighted work, applies the Supreme Court’s framework recognizing that creative expression resides at the core of copyright protective purpose, while factual or functional materials occupy a more peripheral position. As per Campbell v. Acuff-Rose Music, this factor acknowledges ‘some works are closer to the core of intended copyright protection than others,’ establishing a graduated spectrum where visual artworks command stronger safeguards than code, scholarly articles, or news reports. This hierarchy, articulated in Sony v. Universal, renders the use of highly creative works less likely to qualify as fair use- a principle carrying particular force in generative AI contexts where training sets include content that is not highly expressive.

Publication status further informs this analysis as a judicially recognised gloss on the statutory factor. Though Congress amended §107 to clarify that unpublished status is not dispositive, Swatch Group Management v. Bloomberg LP established that unpublished works weigh against fair use given copyright’s traditional role in protecting first publication rights. The Office notes most AI training datasets consist of published materials, which ‘modestly support a fair use argument’ per consensus, while cautioning that unpublished content, whether inadvertently ingested or deliberately sourced, intensifies infringement risks.

Industry submissions reinforce this bifurcation, observing that training on novels or visual artworks fits squarely within copyright’s protective domain whereas functional code or factual compilations present weaker claims. As the Authors Guild emphasised, the second factor ‘would weigh against fair use where works are highly creative and closer to the heart of copyright,’ particularly for visual artworks whose value lies in expressive singularity. Nevertheless, the Office concurs with commenters who view this factor as rarely decisive alone, noting its doctrinal gravity is typically subordinate to commercial purpose and market harm. Ultimately, the Office concludes that where training relies on unpublished materials or highly expressive works, this factor will disfavor fair use.

Third Factor

The Copyright Office’s third-factor analysis, evaluating the amount and substantiality of copyrighted material used, confronts the reality that generative AI systems typically ingest entire works during training. Under §107, this factor examines whether the quantity copied is ‘reasonable in relation to the purpose of the copying,’ a context-sensitive inquiry that diverges sharply from precedents like Authors Guild v. Google. Where Google Books’ full-text copying enabled non-expressive search functions and limited snippet displays, the Office emphasises that AI’s wholesale ingestion lacks comparable transformative justification, observing that ‘the use of entire copyrighted works is less clearly justified in the context of AI training than it was for Google books or thumbnail image search.’

Crucially, the report rejects categorical condemnation of full-work copying, acknowledging that functional necessity may render such scale reasonable if developers demonstrate both 1) a highly transformative purpose for training, and 2) robust technical safeguards preventing output of substantially similar protected expression. This nuanced calibration reflects Sega Enterprises v. Accolade’s legacy where reverse-engineering entire software packages was deemed reasonable for interoperability while underscoring AI’s distinct risks; absent guardrails, models risk regurgitating protected content at scale. The analysis positions output controls as pivotal mitigators; where effective constraints exist, the third factor’s weight against fair use diminishes proportionality.

Yet the Office tempers this flexibility with stark caution. Training on qualitatively significant portions such as a photograph’s compositional essence, intensifies infringement concerns even when quantitatively minor, per Harper & Row’s ‘heart of the work’ doctrine. Unpublished materials attract particular scrutiny, as their unauthorised ingestion deprives rights holders of first publication control. Ultimately, while full-scale copying proves functionally necessary for model optimisation, its justification remains contingent on evidence that deployment contexts avoid market substitution.

Fourth Factor

The Copyright Office’s analysis of the fourth fair use factor, effect on the potential market for or value of the copyrighted work, arguably constitutes the report’s most consequential and controversial intervention, introducing market dilution as a novel theory of harm that expands traditional infringement paradigms. While reaffirming established harms like lost sales from direct displacement by AI-generated substitutes, and lost licensing opportunities, emphasising that feasible markets for training data ‘disfavor fair use where licensing options exist,’ the Office contends that generative AI’s unprecedented scale enables uniquely corrosive market effects. Specifically, the report warns that AI’s capacity for stylistic imitation, even absent verbatim copying, could flood markets with outputs that lower prices, reduce demand for original works, and hurt authorship by saturating creative sectors with algorithmically generated content. This dilution theory, while acknowledging that copyright traditionally targets infringement rather than competition, posits that the speed and scale of AI output production threatens to devalue human creativity in ways courts have never before confronted it.

The Office grounds this theory in statutory language protecting a work’s ‘value’, arguing that style implicates ‘protectable elements of authorship’ and that saturation by stylistically derivative AI outputs could diminish a creator’s commercial distinctiveness. Though analogizing to Sony Corp v. Universal City Studios, where the Court considered harms from ‘widespread’ unauthorised copying, the report concedes market dilution enters ‘uncharted territory’ judicially. No court has yet adopted such a framework, and its viability hinges on whether judges accept that non-infringing stylistic competition can constitute cognizable harm under fair use’s fourth factor. The Office acknowledges this theory’s vulnerability, noting courts may demand empirical evidence beyond policy concerns or anecdotal examples and that its persuasive authority under Skidmore deference depends on the strength of its reasoning.

Importantly, the dilution theory may face several doctrinal tensions. Firstly, copyright historically permits market competition from non-infringing works, even when it harms original creators. Objections to AI-driven dilution stem from its ease of production,distribution, and resulting scale, raising questions about whether copyright should shield markets from technological disruption. Secondly, critics contend that recognising dilution could paradoxically stifle creativity by enabling rights holders to suppress tools producing non-infringing works, potentially chilling production and distribution of new works by human creators leveraging AI ethically. Finally, the Office subtly invokes creators’ ‘economic and moral interests’ in their works’ unique stylistic value, aligning with scholarly views that ‘value’ encompasses non-substitutionary harms like lost attribution or cultural decontextualisation.

Amid ongoing litigation like Kadrey v. Meta, where courts grapple with output-based market effects, the report’s dilution framework offers plaintiffs a strategic tool to argue systemic harm beyond individual infringement. Yet its ultimate judicial reception remains uncertain, particularly given the Office’s concurrent political upheaval and the theory’s departure from precedent, the dilution framework challenges the AI industry by inviting courts to reconsider whether copyright’s purpose, protecting the ‘fruits of intellectual labor’, must evolve to address algorithmic economies of scale.

Licensing

The Office’s report champions voluntary and collective licensing as the optimal path to resolve AI training disputes, explicitly favoring market-driven solutions over regulatory intervention. This approach recognises emerging industry practices; visual media platforms like Getty Images offer structured reuse agreements. These real-world models demonstrate that scalable compensation frameworks are feasible, reducing transaction costs while enabling tailored terms for duration, exclusivity, and territorial scope.

For contexts where direct licensing remains impractical, the Office endorses extended collective licensing as a supplementary mechanism (ECL). Modeled on Scandinavian and UK systems, ECL empowers certified collecting management organizations to license entire repertoires (including non-members’ works) under government oversight, subject to robust opt-out rights that preserve creator autonomy. Such frameworks address the ‘copyright iceberg’ problem by covering orphan works and simplifying bulk permissions. Crucially, the Office rejects compulsory licensing as premature and incompatible with US copyright principles, noting the absence of systemic market failure justifying state-mandated rates. Voluntary agreements between AI developers and publishers, such as Adobe’s compensated artist partnerships for Firefly training, demonstrate functional market dynamics without government coercion. While acknowledging ECL’s potential to bridge gaps, the report cautions against premature regulatory intrusion, emphasizing that licensing markets require space to evolve organically. Instead it advocates for targeted guardrails: certification standards to ensure CMO representativeness, ironclad opt-out protections, and pilot programs in discrete sectors like academic publishing before broader implementation.

Concluding Thoughts

Cooper and Grimmelmann’s incisive reminder, that AI models are not ‘magical portals’ extracting knowledge from parallel universes but data structures built from human creative labor, anchors the Office’s report. The Office methodically establishes that training generative AI implicates reproduction rights at every stage: dataset creation, weight memorization, and RAG’s real-time copying. Its rigorous fair use analysis dismantles industry claims of inherent transformativeness, instead demanding context-specific scrutiny of outputs and market harm. Most provocatively, it endorses market dilution as recognizable injury, implying that stylistic imitation at scale devalues human artistry even without infringement.

Yet the report’s release amid leadership upheaval and pending litigation leaves its authority in flux. While championing voluntary licensing as the optimal path, its novel doctrinal frameworks, particularly dilution, face untested judicial terrain. Ultimately, the Office charts a pragmatic course, acknowledging AI’s technical necessities while centering copyright’s mandate to protect creative labor. As Cooper and Grimmelmann caution, progress lies not in magical thinking about ‘parallel universes’, but in ethically engaging the human expression fueling these systems. The path forward demands negotiated coexistence, where innovation credits its sources, and creation retains its worth.

Suggested readings:

Why A.I. isn’t going to make art, The New Yorker, August 31 2024

Understanding artists’ perspectives on generative AI art and transparency, ownership, and fairness, AI Hub, January 14 2025

How artists are using generative AI to celebrate the natural world, UK Creative Festival, January 15 2025

Stopping the Trump Administration’s Unlawful Firing Of Copyright Office Director, Democracy Forward, May 22 2025

About the author:

Juliette Groothaert (Summer Intern 2025, Center for Art Law) is a law student at the University of Bristol, graduating in 2025. She is interested in the evolving relationship between intellectual property law and artistic expression, which she hopes to explore further through an LLM next year. As a summer legal intern, she is contributing to research in this field while contributing to the Center’s Nazi-Looted Art Database.

Select Sources:

Juliette Groothaert

Upon asking DALL-E 3 to “create a scenic view of the sea in the style of Van Gogh”, the image appearing on the left was generated within seconds. When compared to The Starry Night on the right, the stylistic resemblance is immediately apparent: swirling skies, radiating light forms, bold brushstrokes, and bright color contrasts.

Yet, as reminded by Cooper and Grimmelman, ‘a model is not a magical portal that pulls fresh information from some parallel universe into our own.’

This basic understanding provides critical context for understanding the copyright implications of generative AI. Generative AI models, as sophisticated data-driven structures, operate on mathematical constructs derived wholly from their training datasets. The expanding general usability of these models has only intensified the demand for such datasets. To enhance quality, accuracy, and flexibility, industry submissions confirm these systems typically require ‘millions or billions of works for training purposes,’ including terabyte-scale datasets for foundation models. As a result, this reliance on pre-existing copyrighted materials has catalyzed numerous legal challenges.

Several prominent examples include The New York Times v. Microsoft Corp case involving unauthorised use of proprietary journalism to train language models; visual arts disputes such as Zhang v. Google LLC, Andersen v. Stability AI, and Getty Images v. Stability AI; and most importantly, the landmark ruling in Thomson Reuters v. Ross Intelligence. In Reuters, although concerning the use of copyrighted legal materials to train a non-generative AI research tool, the court found that copyright infringement had occurred through unauthorised use of legal headnotes and structure to train a competing research tool. Collectively, these cases, which now exceed forty pending lawsuits, center on a pivotal legal question: whether using copyrighted works for AI training is fair use, particularly when employed in generative systems producing output.

Against this contentious backdrop, the United States Copyright Office (‘Office’) advanced this discourse on May 9 2025, by releasing a pre-publication draft of Part 3 of its comprehensive AI policy report. In March 2023, it issued guidance confirming that human authorship is required for copyright registration, and that applicants must disclose any AI-generated content exceeding a de minimis threshold, along with a description of the human author’s contribution. Following this, the Office issued a notice of Inquiry, soliciting public comments on AI and copyright. It received over 10,000 submissions, which informed the analysis and recommendations presented in the current report. Part 1 and Part 2 of the Office’s Initiative, addressing digital replicas and copyrightability respectively, laid essential groundwork for this third report; the Center for Art Law has published further commentary on both which can be found here for Part 1 and here for Part 2.. This latest report offers the most detailed articulation yet of how copyright law applies to the training of generative AI models. Yet its release coincides with exceptional institutional turbulence. Register Shira Perlmutter’s dismissal days after the report’s publication raises questions about what changes new management might enact. This timing may be particularly delicate for pending cases like Kadrey v. Meta and Bartz v. Anthropic, which directly echo the report’s analysis. Though the report is not legally binding, it enters a legal ecosystem potentially shaping interpretive norms where AI copyright doctrine is actively evolving.

Technical Primer

The Office’s pre-publication recognizes that answers to these legal questions must be technically precise regarding how generative AI systems interact with protected works. Before it considers fair use defenses, the Office systematically lays out how machine learning workflows inherently implicate exclusive rights under copyright law. This technical foundation identifies three essential points of pressure: reproduction rights affected when datasets are being created, the possible embodiment of protected expression with model parameters under memorization, and the dangers characteristic of retrieval-augmented generation systems.

Datasets

Generative AI models, including large-scale language models as well as image generators, are developed through machine learning techniques that deliberately reproduce copyrighted material. Every stage of dataset creation is potentially copyright infringement under 17 U.S.C. § 106(1): the initial downloading from online sources, format conversion, cross-medium transfers, and creation of modified subsets or filtered corpora. Such operations may concurrently implicate the derivative work right under § 106(2) when involving recasting or transformation of original expression through abridgements, condensations, or other adaptations.

Model Weights

The Office finds that model weights, numerical parameters encoding learned patterns, may represent copies of protected expression where there is substantial memorization involved, implicating reproduction and derivative rights under copyright law. As articulated on page 30 of its report:

…whether a model’s weights implicate the reproduction or derivative work rights turns on whether the model has retained or memorized substantial protectable expression from the works at issue.’

This determination hinges on a fact-specific inquiry: when weights enable outputting verbatim or near-identical content from training data, the Office asserts there is a strong argument that copying those weights infringes memorized works. Judicial approaches reflect this fact-intensive standard, diverging significantly, as seen in Kadrey v. Meta Platforms dismissing claims as ‘nonsensical’ absent allegations of infringing outputs, while Andersen v. Stability AI permitted claims against third party users where plaintiffs demonstrated protected elements persisted within weights. The Office endorses Andersen’s standard, clarifying that infringement turns on whether ‘the model has retained or memorized substantial protectable expression.’ Critically, when protectable material is embedded in weights, subsequent distribution or reuse, even by parties uninvolved in training, could constitute prima facie infringement, creating downstream liability risks that extend far beyond initial model development.

RAG

The Office’s report adopts a notably more assertive stand on retrieval-augmented (RAG) systems than other AI training methods, focusing particularly on the unique legal risks they present. Unlike conventional generative AI models built up from pre-trained datasets, RAG systems actively retrieve and incorporate real-time data from the outside world during output generation. Accordingly, RAG can be understood as functioning in two steps: the system first copies the source materials into a retrieval database, and then, when prompted by a user query, outputs them again. While such an architecture improves accuracy to reality, both the initial unauthorized reproduction and the later relaying of that material are potential copyright infringements which do not qualify as fair use. These remarks hold especially true when one is summarizing or abridging copyrighted works like news stories rather than merely linking to them.

This categorical stance stems from RAG’s close connection to traditional content markets. With routine AI training, works find their way into the confines of patterns and statistical norms. But RAG outputs retain verbatim excerpts and at times compete directly with originals, threatening core revenue streams for rights holders. For instance, systems found in Perplexity AI, now facing the first US lawsuit targeting RAG technology, allegedly enable users to ‘skip the links’ to go to source material. This diverts traffic and ad revenue away from publishers like The Wall Street Journal that used to bring their reader directly to inside stories through hyperlinks. Unlike established cases like Authors Guild v. Google, RAG itself does not use snippet functions to help people find sources of information. This is where RAG is so different from the past: it risks blending the original and the derived to blur the line between search utility and a competitor commercial service. Having relied heavily upon unauthorized sources, RAG’s activities are a commercial choice rather than one driven by technical necessity because there are feasible alternatives such as licensed APIs. This weakens the argument for fair use as a transformative defence, as RAG’s outputs frequently repeat the expressive purpose and economic value of the underlying works. In essence, the Office’s sharp condemnation of RAG signals a pivotal shift; as licensing markets for training data mature, unlicensed real-time ingestion faces existential legal threats. Cours are increasingly tasked with reconciling innovation incentives with the uncompensated exploitation that drives what some see as RAG’s double-barreled infringement.

Fair Use Factors

The Office’s report thoroughly refutes the assumption that AI training automatically enjoys broad fair use coverage, emphasisng that when it comes to creating datasets from copyrighted works, copying them constitutes prima facie infringement under 17 U.S.C. § 106(1). Against this backdrop, the Office applies the statutory four-factor test under §107 with notable rigour, rejecting categorical exemptions for machine learning. Pre-publication guidance explores these factors in depth under section IV, which will be covered below.

First Factor

The Office’s first factor analysis, centered on the purpose and character of use, applies the Supreme Court’s framework in Warhol v. Goldsmith, rejecting absolute claims of transformativeness and instead demanding that the actualities of use be closely scrutinized. The Office stresses that the potential for transformation cannot be judged purely on how models are trained; instead courts must consider what those trained models do in the field. This approach explicitly incorporates Warhol’s instruction to evaluate the ‘purpose and function’ in relation to original artwork, moving from straightforward textual comparisons of content incorporated or resembled to whether outputs serve as substitutes on the market.

Adam Liptak, Supreme Court Rules Against Warhol Foundation in Prince Case, N.Y. Times (May 18, 2023), https://www.nytimes.com/2023/05/18/us/andy-warhol-prince-lynn-goldsmith.html.

Critically, the report dismantles two key industry arguments. First, that training is a mechanical process that creates non-experiential reality by computer input, and secondly, that it parallels human learning. The Office counters that generative models transform not only semantic meanings but the expensive genre of copyrighted works as well; they study in particular ‘how words are selected and arranged at the sentence, paragraph, and document level.’ This stands in stark contrast to human memory, where learners retain imperfect impressions filtered through unique perspectives. While humans provide the creative ecosystem that the marketplace must have to live off in derivative work, AI reproduces content beyond human speed and scale which enables market-disruptive reproduction.

Further, this analysis spells out protective measures after the deployment and incorporation of data as specific pointers. Proof that the author installed robust guardrails to prevent verbatim output might validate transformativeness by revealing intent that systems be used for different purposes- random objectives can never work, as cautioned in Warhol, and if reality contradicts intention, there is nothing to back it up. Simultaneously, extensive use of pirated datasets weighs against fair use, especially if models generate content competing with the works illegally accessed by trained agents, a reality now germane to ongoing litigation, due largely to most large language models’ dependence on shadow databases.

Ultimately, the Office adopts a nuanced assessment for transformativeness in generative AI. If models are trained on specific genres to produce content for identical audiences, the use is at best moderately transformative given shared commercial and expressive purposes. This calculus weighs input-side considerations (data legality, training indent) against output consequences (market substitution, functional divergence), to ensure transformativeness never outweighs other fair use analysis. As Warhol affirmed and the Office endorses, a transformative use can still infringe upon an original work if it serves the same purpose and market.

Second Factor

The Office’s examination of the second fair use factor, the nature of the copyrighted work, applies the Supreme Court’s framework recognizing that creative expression resides at the core of copyright protective purpose, while factual or functional materials occupy a more peripheral position. As per Campbell v. Acuff-Rose Music, this factor acknowledges ‘some works are closer to the core of intended copyright protection than others,’ establishing a graduated spectrum where visual artworks command stronger safeguards than code, scholarly articles, or news reports. This hierarchy, articulated in Sony v. Universal, renders the use of highly creative works less likely to qualify as fair use- a principle carrying particular force in generative AI contexts where training sets include content that is not highly expressive.

Publication status further informs this analysis as a judicially recognised gloss on the statutory factor. Though Congress amended §107 to clarify that unpublished status is not dispositive, Swatch Group Management v. Bloomberg LP established that unpublished works weigh against fair use given copyright’s traditional role in protecting first publication rights. The Office notes most AI training datasets consist of published materials, which ‘modestly support a fair use argument’ per consensus, while cautioning that unpublished content, whether inadvertently ingested or deliberately sourced, intensifies infringement risks.

Industry submissions reinforce this bifurcation, observing that training on novels or visual artworks fits squarely within copyright’s protective domain whereas functional code or factual compilations present weaker claims. As the Authors Guild emphasised, the second factor ‘would weigh against fair use where works are highly creative and closer to the heart of copyright,’ particularly for visual artworks whose value lies in expressive singularity. Nevertheless, the Office concurs with commenters who view this factor as rarely decisive alone, noting its doctrinal gravity is typically subordinate to commercial purpose and market harm. Ultimately, the Office concludes that where training relies on unpublished materials or highly expressive works, this factor will disfavor fair use.

Third Factor

The Copyright Office’s third-factor analysis, evaluating the amount and substantiality of copyrighted material used, confronts the reality that generative AI systems typically ingest entire works during training. Under §107, this factor examines whether the quantity copied is ‘reasonable in relation to the purpose of the copying,’ a context-sensitive inquiry that diverges sharply from precedents like Authors Guild v. Google. Where Google Books’ full-text copying enabled non-expressive search functions and limited snippet displays, the Office emphasises that AI’s wholesale ingestion lacks comparable transformative justification, observing that ‘the use of entire copyrighted works is less clearly justified in the context of AI training than it was for Google books or thumbnail image search.’

Crucially, the report rejects categorical condemnation of full-work copying, acknowledging that functional necessity may render such scale reasonable if developers demonstrate both 1) a highly transformative purpose for training, and 2) robust technical safeguards preventing output of substantially similar protected expression. This nuanced calibration reflects Sega Enterprises v. Accolade’s legacy where reverse-engineering entire software packages was deemed reasonable for interoperability while underscoring AI’s distinct risks; absent guardrails, models risk regurgitating protected content at scale. The analysis positions output controls as pivotal mitigators; where effective constraints exist, the third factor’s weight against fair use diminishes proportionality.

Yet the Office tempers this flexibility with stark caution. Training on qualitatively significant portions such as a photograph’s compositional essence, intensifies infringement concerns even when quantitatively minor, per Harper & Row’s ‘heart of the work’ doctrine. Unpublished materials attract particular scrutiny, as their unauthorised ingestion deprives rights holders of first publication control. Ultimately, while full-scale copying proves functionally necessary for model optimisation, its justification remains contingent on evidence that deployment contexts avoid market substitution.

Fourth Factor

The Copyright Office’s analysis of the fourth fair use factor, effect on the potential market for or value of the copyrighted work, arguably constitutes the report’s most consequential and controversial intervention, introducing market dilution as a novel theory of harm that expands traditional infringement paradigms. While reaffirming established harms like lost sales from direct displacement by AI-generated substitutes, and lost licensing opportunities, emphasising that feasible markets for training data ‘disfavor fair use where licensing options exist,’ the Office contends that generative AI’s unprecedented scale enables uniquely corrosive market effects. Specifically, the report warns that AI’s capacity for stylistic imitation, even absent verbatim copying, could flood markets with outputs that lower prices, reduce demand for original works, and hurt authorship by saturating creative sectors with algorithmically generated content. This dilution theory, while acknowledging that copyright traditionally targets infringement rather than competition, posits that the speed and scale of AI output production threatens to devalue human creativity in ways courts have never before confronted it.

The Office grounds this theory in statutory language protecting a work’s ‘value’, arguing that style implicates ‘protectable elements of authorship’ and that saturation by stylistically derivative AI outputs could diminish a creator’s commercial distinctiveness. Though analogizing to Sony Corp v. Universal City Studios, where the Court considered harms from ‘widespread’ unauthorised copying, the report concedes market dilution enters ‘uncharted territory’ judicially. No court has yet adopted such a framework, and its viability hinges on whether judges accept that non-infringing stylistic competition can constitute cognizable harm under fair use’s fourth factor. The Office acknowledges this theory’s vulnerability, noting courts may demand empirical evidence beyond policy concerns or anecdotal examples and that its persuasive authority under Skidmore deference depends on the strength of its reasoning.

Importantly, the dilution theory may face several doctrinal tensions. Firstly, copyright historically permits market competition from non-infringing works, even when it harms original creators. Objections to AI-driven dilution stem from its ease of production,distribution, and resulting scale, raising questions about whether copyright should shield markets from technological disruption. Secondly, critics contend that recognising dilution could paradoxically stifle creativity by enabling rights holders to suppress tools producing non-infringing works, potentially chilling production and distribution of new works by human creators leveraging AI ethically. Finally, the Office subtly invokes creators’ ‘economic and moral interests’ in their works’ unique stylistic value, aligning with scholarly views that ‘value’ encompasses non-substitutionary harms like lost attribution or cultural decontextualisation.

Amid ongoing litigation like Kadrey v. Meta, where courts grapple with output-based market effects, the report’s dilution framework offers plaintiffs a strategic tool to argue systemic harm beyond individual infringement. Yet its ultimate judicial reception remains uncertain, particularly given the Office’s concurrent political upheaval and the theory’s departure from precedent, the dilution framework challenges the AI industry by inviting courts to reconsider whether copyright’s purpose, protecting the ‘fruits of intellectual labor’, must evolve to address algorithmic economies of scale.

Licensing

The Office’s report champions voluntary and collective licensing as the optimal path to resolve AI training disputes, explicitly favoring market-driven solutions over regulatory intervention. This approach recognises emerging industry practices; visual media platforms like Getty Images offer structured reuse agreements. These real-world models demonstrate that scalable compensation frameworks are feasible, reducing transaction costs while enabling tailored terms for duration, exclusivity, and territorial scope.

For contexts where direct licensing remains impractical, the Office endorses extended collective licensing as a supplementary mechanism (ECL). Modeled on Scandinavian and UK systems, ECL empowers certified collecting management organizations to license entire repertoires (including non-members’ works) under government oversight, subject to robust opt-out rights that preserve creator autonomy. Such frameworks address the ‘copyright iceberg’ problem by covering orphan works and simplifying bulk permissions. Crucially, the Office rejects compulsory licensing as premature and incompatible with US copyright principles, noting the absence of systemic market failure justifying state-mandated rates. Voluntary agreements between AI developers and publishers, such as Adobe’s compensated artist partnerships for Firefly training, demonstrate functional market dynamics without government coercion. While acknowledging ECL’s potential to bridge gaps, the report cautions against premature regulatory intrusion, emphasizing that licensing markets require space to evolve organically. Instead it advocates for targeted guardrails: certification standards to ensure CMO representativeness, ironclad opt-out protections, and pilot programs in discrete sectors like academic publishing before broader implementation.

Concluding Thoughts

Cooper and Grimmelmann’s incisive reminder, that AI models are not ‘magical portals’ extracting knowledge from parallel universes but data structures built from human creative labor, anchors the Office’s report. The Office methodically establishes that training generative AI implicates reproduction rights at every stage: dataset creation, weight memorization, and RAG’s real-time copying. Its rigorous fair use analysis dismantles industry claims of inherent transformativeness, instead demanding context-specific scrutiny of outputs and market harm. Most provocatively, it endorses market dilution as recognizable injury, implying that stylistic imitation at scale devalues human artistry even without infringement.

Yet the report’s release amid leadership upheaval and pending litigation leaves its authority in flux. While championing voluntary licensing as the optimal path, its novel doctrinal frameworks, particularly dilution, face untested judicial terrain. Ultimately, the Office charts a pragmatic course, acknowledging AI’s technical necessities while centering copyright’s mandate to protect creative labor. As Cooper and Grimmelmann caution, progress lies not in magical thinking about ‘parallel universes’, but in ethically engaging the human expression fueling these systems. The path forward demands negotiated coexistence, where innovation credits its sources, and creation retains its worth.

Suggested readings:

Why A.I. isn’t going to make art, The New Yorker, August 31 2024

Understanding artists’ perspectives on generative AI art and transparency, ownership, and fairness, AI Hub, January 14 2025

How artists are using generative AI to celebrate the natural world, UK Creative Festival, January 15 2025

Stopping the Trump Administration’s Unlawful Firing Of Copyright Office Director, Democracy Forward, May 22 2025

About the author:

Juliette is a final-year law student at the University of Bristol, graduating in 2025. She is interested in the evolving relationship between intellectual property law and artistic expression, which she hopes to explore further through an LLM next year. As a summer legal intern, she is contributing to research in this field while contributing to the Center’s Nazi-Looted Art Database.

 






Disclaimer: This article is for educational purposes only and is not meant to provide legal advice. Readers should not construe or rely on any comment or statement in this article as legal advice. For legal advice, readers should seek a consultation with an attorney.




Share this content:

I am a passionate blogger with extensive experience in web design. As a seasoned YouTube SEO expert, I have helped numerous creators optimize their content for maximum visibility.

Leave a Comment