September 2025

U.S. Copyright Office’s Three-Part Report on Copyright and AI
Part III: Generative AI Training

Jane Tsai

Partner

Vivi Tseng

Associate

In May 2025, the U.S. Copyright Office released the third installment of its three-part report on “Copyright and Artificial Intelligence,” addressing the issue of Generative AI Training. This report is based on the Notice of Inquiry (NOI) issued by the U.S. Copyright Office in August 2023 and the numerous responses submitted by stakeholders. The NOI sparked a wide-ranging debate among industry stakeholders, academics, and creators. At its core, the debate highlights a delicate balance between two competing values: on one side, the urgent need for technological innovation, and on the other, the protection of copyrighted works and the interests of creators.

Our previous writings have introduced the first and second report. This article continues with the third report, outlining its main findings and recommendations, and offering reflections on their implications for our jurisdiction.

I. Technical Background: How Are Generative AI Models Trained?
Before turning to the copyright implications of AI model training, it is necessary to understand the basic technical principles of generative AI training:

1. Machine Learning and Neural Networks
Generative AI models do not operate through human-designed program rules. Instead, they rely on machine learning processes to “learn” statistical patterns and relationships from massive amounts of training data. This process is accomplished through neural networks, which are complex mathematical functions composed of billions of parameters or weights. During training, these weights are repeatedly adjusted to improve the model’s performance. Ultimately, the trained model is defined by the patterns embedded in these learned weights.

2. Training Data
The performance of generative AI is highly dependent on the quantity, quality, and purpose of its training data. In terms of quantity, models typically require millions or even billions of works, with scale directly linked to performance. Common sources of training data include publicly available online material, licensed data, or developers’ own user data. It should be noted that “publicly available” does not necessarily mean “authorized,” which is one of the key issues in copyright disputes. Regardless of its origin, training data typically undergoes curation, including filtering, cleaning, and compiling. These steps not only affect model performance but also raise significant copyright and licensing questions.

3. Training Process
Some responses to the NOI distinguished between a pre-training phase and a post-training phase. Pre-training is critical to the core capabilities of generative AI, during which the model is exposed to vast amounts of text or other content and learns to predict the next token. By repeatedly processing billions of examples, the model gradually learns the latent patterns underlying language, images, or audio.

4. Memorization and Deployment
Generative AI models may exhibit memorization, meaning that outputs can closely resemble or even replicate portions of the training data. Developers such as OpenAI argue that models contain only statistical parameters and weights, not literal “copies” of training data. However, some NOI comments questioned this claim, arguing that when a model reproduces outputs that are highly similar to specific works, the effect is functionally equivalent to memorization. Scholars have taken a middle position, noting that the patterns learned by models may be abstract or specific; when they become highly specific, “memorization” occurs. While memorization may improve the model’s utility, it also raises significant copyright concerns when outputs closely mirror copyrighted works.

II. Prima Facie Infringement
Under U.S. copyright law, copyright owners are granted exclusive rights of reproduction, distribution, public performance, public display, and preparation of derivative works. To establish prima facie infringement, two elements must be satisfied: (1) ownership of a valid copyright, and (2) copying of constituent elements of the work that are original. The development and use of generative AI may implicate prima facie infringement of protected works in the following core stages:

1. Data Collection and Curation
The first step of AI training is the acquisition and preparation of data. Acts such as downloading works or transferring them across storage media directly implicate the reproduction right. Many commentators have concluded that the copying involved at this stage of data collection and curation constitutes an infringement of the reproduction right.

2. Training
The training of AI models involves multiple acts of copying and processing data, potentially resulting in model weights that embody reproductions of the training material. Thus, even individuals who later copy the weights without having participated in the training may still be engaged in prima facie infringement. As noted earlier, models may exhibit memorization of training examples. If a model is able to generate outputs that are substantially similar to training data without external prompting, this indicates that protected expression has, in some form, been retained within the weights.

Courts have reached different conclusions on whether model weights themselves infringe. In Andersen v. Stability AI ^[1] , the court found infringement where the defendant merely downloaded a trained model, reasoning that “the model already contained copies of protected elements.” The U.S. Copyright Office has agreed with this approach, noting that the key inquiry is whether the model retains or memorizes substantially protectable expression from the original works.

3. Retrieval-Augmented Generation (RAG)
RAG is a technique that retrieves external data in real time and incorporates it into responses. The act of reproducing or extracting such external sources may itself implicate the reproduction right.

4. Outputs
When the outputs of generative AI are substantially similar to or nearly identical to original works, they may infringe the reproduction right. If the outputs modify or adapt the original, they may infringe the derivative works right, and depending on the context, could also implicate the rights of public display or public performance.

Given that acts across all these stages of AI model training may constitute prima facie infringement, the main question becomes whether such uses qualify for the fair use exception.

III. Fair Use Analysis of AI Training
Fair use is a core equitable principle in U.S. copyright law, permitting the unlicensed use of copyrighted works in certain circumstances. Under U.S. law, courts apply a four-factor test to determine whether a use qualifies as fair use:
(1) Purpose and Character of the Use
(2) Nature of the Copyrighted Work
(3) Amount and Substantiality of the Portion Used
(4) Effect of the Use upon the Potential Market

1. Factor One: Purpose and Character of the Use
In applying this factor, courts focus primarily on Transformativeness and Commerciality.

(1) Transformativeness
The central inquiry is whether the new use merely supplants the original or adds new expression, meaning, or message. The greater the degree of transformation, the more likely the use will be deemed fair use.

The U.S. Copyright Office has observed that the use of large and diverse datasets to train foundation models generally exhibits a transformative purpose, though the degree of transformativeness varies by case. Because generative AI can serve both transformative and non-transformative purposes, developers who implement safeguards (such as declining to output verbatim excerpts of copyrighted works) may strengthen the case for fair use under this factor.

(2) Commerciality
This inquiry considers whether the use unfairly exploits copyrighted works for commercial gain. The Office has emphasized that the distinction is not simply whether the use is for profit, but whether the use serves a substantive commercial purpose.

2. Factor Two: Nature of the Copyrighted Work
This factor recognizes that different categories of works receive different levels of protection. Works that are factual or functional (e.g., news reports) are more likely to fall within fair use, whereas highly creative works (e.g., novels, songs, or paintings) are less likely to qualify.

3. Factor Three: Amount and Substantiality of the Portion Used
Courts do not only measure the amount of copying in quantitative terms, but also evaluate the qualitative importance of the portion used. Even small excerpts may weigh against fair use if they capture the “heart” of the work.

In cases such as Sony v. Connectix ^[2] and Sega v. Accolade ^[3] , courts considered the “amount ultimately made available to the public” to be an important factor. They concluded that although defendants made complete intermediate copies to extract functional elements, the end products did not expose protectable expression to the public, and thus this factor weighed less heavily against fair use.

4. Factor Four: Effect upon the Potential Market
This factor is often considered the most important. It encompasses several types of potential harm:

(1) Lost Sales:
If AI outputs are nearly identical to or substantially similar to original works, consumers may substitute them for the original, leading directly to lost sales.

(2) Market Dilution:
“Market dilution” refers to situations where, even if AI outputs do not directly copy a particular work, the large volume of outputs in similar styles competes with the market for that work. For example, when large volumes of AI-generated romance novels or music flood the market, they compete with human-authored works, diluting sales and royalties, reducing incentives for creation, and causing significant harm to the market for works in the same genre.

(3) Lost Licensing Opportunities:
The loss of actual or potential licensing revenue also constitutes market harm. Many industries argue that licensing training data is a viable commercial model, and both the news and music sectors have entered into licensing agreements. However, the scale of AI training, high costs, and fragmented ownership often make comprehensive licensing impractical.

The U.S. Copyright Office has taken a nuanced view: when a licensing market exists or is reasonably likely to develop, unlicensed uses tend to weigh against fair use. However, where insurmountable licensing barriers prevent a licensing market from functioning—because no licensing channel exists for the works at issue—unlicensed uses may nevertheless be considered fair use.

(4) Public Benefits:
Some commenters to the NOI emphasized the public benefits of unlicensed training. For example, OpenAI argued that generative AI promotes human creativity, while Meta asserted in litigation that models built on Llama support the deployment of “life-saving services and technologies.” The U.S. Copyright Office, however, has concluded that such benefits do not decisively shift the boundaries of fair use.

IV. The Feasibility of Licensing for AI Training
If the use of copyrighted works in AI training is found not to qualify as fair use, developers would be required to obtain licenses from copyright owners. The following section examines the feasibility of licensing for AI training, the associated challenges, and potential licensing models.

Licensing Models

Analysis

Voluntary Licensing

1. Direct Licensing:
A license negotiated directly between individual copyright owners and users (such as AI developers).

1. Feasibility of Voluntary Licensing:
Some industry representatives argued that securing licenses for the massive volume and diversity of copyrighted content required for AI training would be prohibitively costly and administratively burdensome. In contrast, commenters representing creators maintained that licensing fees are a necessary cost of doing business, and that invoking “too expensive” as a justification to avoid licensing is not reasonable.

2. Ability to Provide Meaningful Compensation:
Some industry representatives argued that because AI training requires enormous amounts of data, even if the total licensing fees were high, the amount received by individual creators would be minimal and not cost-effective. By contrast, commenters on behalf of creators contended that even small payments could incentivize new creative works, and suggested that AI companies could consider future revenue-sharing models in place of traditional lump-sum royalties.

3. Possible Legal Impediments to Collective Licensing:
Some commenters warned that collective negotiations among copyright owners could raise antitrust concerns. To address this, they proposed creating an antitrust exemption specifically for collective licensing arrangements related to AI training.

2. Collective Licensing:
Licenses are issued through third-party organizations, typically known as collective management organizations (CMOs). Copyright owners delegate their licensing rights to such organizations, which negotiate and license on behalf of multiple rightsholders to users such as AI developers.

Statutory Approaches

1. Compulsory Licensing:
A system established by law that permits users to exploit copyrighted works without the rightsholder’s consent, provided that statutory requirements are met and statutory royalties are paid.

1. Advantages:
Eliminates the need for individual negotiations and reduces high transaction costs.

2. Disadvantages:
(1) Undermines the rightsholder’s ability to control the use and dissemination of their works, depriving them of the freedom to choose partners, determine usage, and negotiate compensation.

(2) Entails significant administrative costs, as establishing such a regime requires a large bureaucratic framework.

(3) Risks becoming rigid and unable to adapt to rapid advances in generative AI, ultimately disadvantaging both copyright owners and AI developers.

2. Extended Collective Licensing (ECL):
A system in which a collective management organization negotiates licenses in the open market that apply to entire categories of copyrighted works for specific uses. To obtain such authority, the CMO typically must demonstrate that it represents a substantial number of rightsholders within that category.

1. Advantages:
Combines the flexibility of voluntary licensing with the breadth of compulsory licensing, while lowering transaction costs.

2. Disadvantages:
Some commenters argue that ECL shares many of the drawbacks of compulsory licensing and, due to its large scale, poses serious implementation challenges.

The U.S. Copyright Office offered the following recommendation: the government should refrain from intervention for the time being, allowing and encouraging the voluntary licensing market to continue to develop. Although challenges remain in certain sectors—such as fragmented ownership and high transaction costs—this report observed an optimistic trend: both direct and collective voluntary licensing agreements have expanded and matured in recent years. This development demonstrates that market-based licensing for AI training is both feasible and promising. Accordingly, the Office concluded that the market should be given sufficient space to self-regulate, rather than resorting prematurely to government-imposed compulsory measures.

V. The Position of the U.S. Copyright Office
After a comprehensive analysis of the technical foundations of generative AI, the applicability of current law, and market dynamics, the U.S. Copyright Office adopted the following positions and recommendations on copyright and AI:

1. Flexibility of the Existing Legal Framework
The Office emphasized that the current U.S. legal framework—particularly the highly flexible doctrine of fair use—is sufficient to address the legal challenges posed by AI training. At this stage, broad legislative amendments are not warranted.

2. Legality of AI Training Should Be Assessed Case by Case
The U.S. Copyright Office emphasized that the legality of AI training must be assessed on a case-by-case basis and cannot be determined categorically. Whether such training qualifies as fair use depends on a holistic evaluation of multiple factors. Non-commercial research models with strong output controls are more likely to fall within fair use, whereas commercial models trained on pirated content that compete with the original works are more likely to constitute infringement. Key considerations include which works are used, the provenance of the data, the purpose of the training, and the degree of control over the model’s ultimate outputs.

3. Support for Market-Oriented Voluntary Licensing
The U.S. Copyright Office expressed a clear preference for market-based solutions, recommending that the voluntary licensing market be allowed to mature. It observed that the growing number of both direct and collective licensing agreements in recent years provides strong evidence that the market is capable of functioning effectively and can help reconcile the demand for training data with the protection of copyright.

This report clearly reflects the Copyright Office’s core stance: to proceed with caution, to encourage market self-regulation, and to uphold the established principle of fair use.

[1] Andersen v. Stability AI Ltd., 744 F. Supp. 3d 956, 982–84 (N.D. Cal. 2024).
[2] Sony Comput. Entm’t v. Connectix, 203 F.3d 596, 606 (9th Cir. 2000).
[3] Sega v. Accolade, 977 F.2d 1510, 1526–27 (9th Cir. 1992).

U.S. Copyright Office’s Three-Part Report on Copyright and AI
Part I: Challenges and Legal Responses to Digital Replicas
U.S. Copyright Office’s Three-Part Report on Copyright and AI
Part II: The Copyrightability of AI-Generated Content

The contents of all materials (Content) available on the website belong to and remain with Lee, Tsai & Partners. All rights are reserved by Lee, Tsai & Partners, and the Content may not be reproduced, downloaded, disseminated, published, or transferred in any form or by any means, except with the prior permission of Lee, Tsai & Partners. The Content is for informational purposes only and is not offered as legal or professional advice on any particular issue or case. The Content may not reflect the most current legal and regulatory developments.

Lee, Tsai & Partners and the editors do not guarantee the accuracy of the Content and expressly disclaim any and all liability to any person in respect of the consequences of anything done or permitted to be done or omitted to be done wholly or partly in reliance upon the whole or any part of the Content. The contributing authors’ opinions do not represent the position of Lee, Tsai & Partners. If the reader has any suggestions or questions, please do not hesitate to contact Lee, Tsai & Partners.