Sarah Silverman sues OpenAI, Meta for being “industrial-strength plagiarists”

Comedian and author Sarah Silverman.

Enlarge / Comedian and author Sarah Silverman. (credit: Jason Kempin / Staff | Getty Images North America)

On Friday, the Joseph Saveri Law Firm filed US federal class-action lawsuits on behalf of Sarah Silverman and other authors against OpenAI and Meta, accusing the companies of illegally using copyrighted material to train AI language models such as ChatGPT and LLaMA.

Other authors represented include Christopher Golden and Richard Kadrey, and an earlier class-action lawsuit filed by the same firm on June 28 included authors Paul Tremblay and Mona Awad. Each lawsuit alleges violations of the Digital Millennium Copyright Act, unfair competition laws, and negligence.

The Joseph Saveri Law Firm is no stranger to press-friendly legal action against generative AI. In November 2022, the same firm filed suit over GitHub Copilot for alleged copyright violations. In January 2023, the same legal group repeated that formula with a class-action lawsuit against Stability AI, Midjourney, and DeviantArt over AI image generators. The GitHub lawsuit was terminated in December 2022 when a court order shows that plaintiffs stopped responding. Procedural maneuvering in the Stable Diffusion lawsuit is still underway with no clear outcome yet.

In a press release last month, the law firm described ChatGPT and LLaMA as “industrial-strength plagiarists that violate the rights of book authors.” Authors and publishers have been reaching out to the law firm since March 2023, lawyers Joseph Saveri and Matthew Butterick wrote, because authors “are concerned” about these AI tools’ “uncanny ability to generate text similar to that found in copyrighted textual materials, including thousands of books.”

The most recent lawsuits from Silverman, Golden, and Kadrey were filed in a US district court in San Francisco. Authors have demanded jury trials in each case and are seeking permanent injunctive relief that could force Meta and OpenAI to make changes to their AI tools.

Meta declined Ars’ request to comment. OpenAI did not immediately respond to Ars’ request to comment.

A spokesperson for the Saveri Law Firm sent Ars a statement, saying, “If this alleged behavior is allowed to continue, these models will eventually replace the authors whose stolen works power these AI products with whom they are competing. This novel suit represents a larger fight for preserving ownership rights for all artists and other creators.”

Accused of using “flagrantly illegal” datasets

Neither Meta nor OpenAI has fully disclosed what’s in the datasets used to train LLaMA and ChatGPT. But lawyers for authors suing say they have deduced the likely data sources from clues in statements and papers released by the companies or related researchers. Authors have accused both OpenAI and Meta of using training datasets that contained copyrighted materials distributed without authors’ or publishers’ consent, including by downloading works from some of the largest e-book pirate sites.

In the OpenAI lawsuit, authors alleged that based on OpenAI disclosures, ChatGPT appeared to have been trained on 294,000 books allegedly downloaded from “notorious ‘shadow library’ websites like Library Genesis (aka LibGen), Z-Library (aka Bok), Sci-Hub, and Bibliotik.” Meta has disclosed that LLaMA was trained on part of a dataset called ThePile, which the other lawsuit alleged includes “all of Bibliotik,” and amounts to 196,640 books.

On top of allegedly accessing copyrighted works through shadow libraries, OpenAI is also accused of using a “controversial dataset” called BookCorpus.

BookCorpus, the OpenAI lawsuit said, “was assembled in 2015 by a team of AI researchers for the purpose of training language models.” This research team allegedly “copied the books from a website called Smashwords that hosts self-published novels, that are available to readers at no cost.” These novels, however, are still under copyright and allegedly “were copied into the BookCorpus dataset without consent, credit, or compensation to the authors.”

Ars could not immediately reach the BookCorpus researchers or Smashwords for comment.

“Numerous questions of law” raised

Authors claim that by utilizing “flagrantly illegal” datasets, OpenAI allegedly infringed copyrights of Silverman’s book The Bedwetter, Golden’s Ararat, and Kadrey’s Sandman Slime. And Meta allegedly infringed copyrights of the same three books, as well as “several” other titles from Golden and Kadrey.

It seems obvious to authors that their books were used to train ChatGPT and LLaMA because the tools “can accurately summarize a certain copyrighted book.” Although sometimes ChatGPT gets some details wrong, its summaries are otherwise very accurate, and this suggests that “ChatGPT retains knowledge of particular works in the training dataset and is able to output similar textual content,” authors alleged.

It also seems obvious to authors that OpenAI and Meta knew that their models were “ingesting” copyrighted materials because all the copyright-management information (CMI) appears to have been “intentionally removed,” authors alleged. That means that ChatGPT never responds to a request for a summary by citing who has the copyright, allowing OpenAI to “unfairly profit from and take credit for developing a commercial product based on unattributed reproductions of those stolen writing and ideas.”

“OpenAI knew or had reasonable grounds to know that this removal of CMI would facilitate copyright infringement by concealing the fact that every output from the OpenAI Language Models is an infringing derivative work, synthesized entirely from expressive information found in the training data,” the OpenAI complaint said.

Among “numerous questions of law” raised in these complaints was a particularly prickly question: Is ChatGPT or LLaMA itself an infringing derivative work based on perhaps thousands of authors’ works?

Authors are already upset that companies seem to be unfairly profiting off their copyrighted materials, and the Meta lawsuit noted that any unfair profits currently gained could further balloon, as “Meta plans to make the next version of LLaMA commercially available.” In addition to other damages, authors are asking for restitution of alleged profits lost.

“Much of the material in the training datasets used by OpenAI and Meta comes from copyrighted works—including books written by plain­tiffs—that were copied by OpenAI and Meta without consent, without credit, and without compensation,” Saveri and Butterick wrote in their press release.

Read on Ars Technica | Comments