May 2025

Generative AI Copyright Lawsuit: RAG Technology Once Again in Focus as News Publishers Sue Cohere

相關律師

Jane Tsai
Jane Tsai

Partner

In February 2025, over a dozen prominent news, magazine, and digital publishers, including Forbes, The Guardian, and the Los Angeles Times, filed a lawsuit in the U.S. District Court for the Southern District of New York against generative AI startup Cohere, alleging copyright and trademark infringement 1 . The plaintiffs claim that Cohere, marketed as providing "trustworthy, verifiable answers," in fact uses Retrieval-Augmented Generation (RAG) technology to build its databases and generate output by unauthorizedly leveraging the publishers' copyrighted content.

RAG technology was proposed by Patrick Lewis and others in 2020 2   to address common issues in large language models such as hallucination, outdated knowledge, and non-transparency. Notably, Patrick Lewis, one of the primary inventors of RAG technology, is currently a researcher at Cohere, continuing to dedicate himself to related technological development. The technology has been widely adopted since its introduction, with companies like Microsoft, Google, Amazon, and NVIDIA all employing it 3 .

The plaintiffs allege the following copyright infringement acts by Cohere:

1. AI Model Training: Cohere extensively scraped text from the internet, including the plaintiffs' works, to create datasets for training its large language model named "Command Family." Furthermore, Cohere used third-party datasets, such as Common Crawl's C4, which contained substantial amounts of the plaintiffs' content, without obtaining authorization from the plaintiffs or others.

2. Real-time Use / RAG: Cohere's services (particularly through its Chat interface) utilize RAG functionality, allowing the model to scrape content in real-time from external sources (including the plaintiffs' websites) to generate responses. The plaintiffs assert that Cohere copied content even when faced with paywalls or the robots.txt directives (commands prohibiting content scraping) on websites.

3. Infringing Outputs: In responding to user queries, Cohere's services provide copies, substantial excerpts, or substitutional summaries of the plaintiffs' works. The plaintiffs provided examples of Cohere Chat outputs, showing its "Under the Hood" panel displaying full or partial articles copied from the plaintiffs' websites. The plaintiffs argue that these outputs, whether verbatim copies or summaries, directly substitute the need for users to visit the original articles, thereby harming the digital subscription and advertising revenue that the plaintiffs rely on.

4. Unauthorized Adaptation: In addition to displaying all or part of the plaintiffs' works under the "Under the Hood" panel, Cohere also provides summaries or abstracts of the plaintiffs' works. However, the level of detail in these summaries or abstracts is such that they almost replace the original works, exceeding the bounds of fair use.

While the plaintiffs allege that Cohere's actions constitute direct copyright infringement, they also claim that Cohere is secondarily liable for direct infringing acts (reproduction, display, distribution of plaintiffs' works) performed by its users through Cohere's services. This prevents Cohere from attributing infringement responsibility solely to user actions (since Cohere's product generates answers only after a user inputs a prompt).

Beyond the copyright infringement claims, the plaintiffs also allege that Cohere's practice of attributing sources constitutes trademark infringement. This includes using the plaintiffs' well-known trademarks without permission or associating them with AI-generated erroneous content, leading to damage to the plaintiffs' brand reputation and a dilution of their distinctiveness.

This case is the second copyright lawsuit focusing on the RAG application in AI services, following the first such case in the U.S. in October 2024. This highlights that as RAG architecture becomes more prevalent in AI services, related copyright disputes are increasingly emerging and are bound to become a significant issue in the future of AI copyright law.
 
1.  Advance Local Media LLC et al. v. Cohere Inc., No. 25-cv-01305 (S.D.N.Y. Feb. 13, 2025).
2.  Patrick Lewis et al., Retrieved-Augmented Generation for Knowledge-Intensive NLP Tasks, ARXIV (Apr. 12, 2021), https://arxiv.org/abs/2005.11401.
3. Harry Booth, Patrick Lewis, Director of Machine Learning, Cohere, TIME (Sept. 5, 2024, 7:10 AM EDT), https://time.com/7012883/patrick-lewis/.

The contents of all materials (Content) available on the website belong to and remain with Lee, Tsai & Partners.  All rights are reserved by Lee, Tsai & Partners, and the Content may not be reproduced, downloaded, disseminated, published, or transferred in any form or by any means, except with the prior permission of Lee, Tsai & Partners.  The Content is for informational purposes only and is not offered as legal or professional advice on any particular issue or case.  The Content may not reflect the most current legal and regulatory developments.

Lee, Tsai & Partners and the editors do not guarantee the accuracy of the Content and expressly disclaim any and all liability to any person in respect of the consequences of anything done or permitted to be done or omitted to be done wholly or partly in reliance upon the whole or any part of the Content. The contributing authors’ opinions do not represent the position of Lee, Tsai & Partners. If the reader has any suggestions or questions, please do not hesitate to contact Lee, Tsai & Partners.

作者