The U.S. Copyright Office released a pre-publication version of its third report on Copyright and Artificial Intelligence, a key installment in its ongoing examination of AI's intersection with copyright law. This report centers on the use of copyrighted materials to train generative AI systems, with a substantial focus on fair use analysis and potential licensing models. The report outlines the technical aspects of AI model training and identifies several stages in the development and deployment of AI where copyright protections could be implicated.
The report details how various activities during AI creation and deployment might constitute copyright infringement, whether by copying or otherwise implicating copyright rights. These actions include the initial acquisition and curation of training datasets containing copyrighted works and the numerous reproductions made throughout the iterative training process. Significantly, the Office clarifies that AI model weights themselves can be considered infringing copies if they "memorize" and embody substantial protectable expression from copyrighted training data. The report notes that, “[l]ike other digital files that encode or compress content . . . the content need not be directly perceivable to constitute a copy,” as long as it is fixed and can be perceived or reproduced with machine aid. Furthermore, the outputs generated by AI systems, particularly those from retrieval-augmented generation (RAG) processes or those that closely replicate original protected works, also present clear infringement risks.
A significant portion of the report is dedicated to applying the fair use doctrine to AI training. The Office examines the first factor, the purpose and character of the use, and the fourth factor, the effect of the use upon the potential market. The report states, “[i]n the Office’s view, training a generative AI foundation model on a large and diverse dataset will often be transformative.” However, the extent of this transformative quality is not absolute; it depends on the AI model's specific function, how it is deployed, and critically, whether its outputs serve as market substitutes for the copyrighted works used in training. The report also dismisses arguments that AI training is inherently non-expressive or directly comparable to human learning for fair use purposes. When considering market impact, the Office explores concepts like "market dilution," where a high volume of AI-generated content could devalue original works, including through stylistic imitation, even without direct copying.
Concerning the licensing of copyrighted works for AI training, the Copyright Office currently “recommends allowing the licensing market to continue to develop without government intervention.” While acknowledging the growth of voluntary licensing agreements, the report also recognizes the logistical and financial hurdles that exist. The Office concludes that determining whether a specific use of copyrighted works in AI training qualifies as fair use will require a fact-specific, case-by-case analysis.
This report's release, even in pre-publication form, provides important direction for AI developers and copyright holders navigating this evolving legal terrain.