throbber
Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 1 of 47
`
`Rachel Geman
`LIEFF CABRASER HEIMANN
` & BERNSTEIN, LLP
`250 Hudson Street, 8th Floor
`New York, NY 10013-1413
`Telephone: 212.355.9500
`rgeman@lchb.com
`
`Reilly T. Stoler (pro hac vice forthcoming)
`Ian R. Bensberg (pro hac vice forthcoming)
`LIEFF CABRASER HEIMANN
` & BERNSTEIN, LLP
`275 Battery Street, 29th Floor
`San Francisco, CA 94111-3339
`Telephone: 415.956.1000
`rstoler@lchb.com
`ibensberg@lchb.com
`
`Attorneys for Plaintiffs and the Proposed Class
`
`Scott J. Sholder
`CeCe M. Cole
`COWAN DEBAETS ABRAHAMS
` & SHEPPARD LLP
`41 Madison Avenue, 38th Floor
`New York, New York 10010
`Telephone: 212.974.7474
`ssholder@cdas.com
`ccole@cdas.com
`
`
`
`
`UNITED STATES DISTRICT COURT
`SOUTHERN DISTRICT OF NEW YORK
`
`No. 1:23-cv-8292
`
`
`CLASS ACTION COMPLAINT
`
`
`JURY TRIAL DEMANDED
`
`
`
`
`AUTHORS GUILD, DAVID BALDACCI,
`MARY BLY, MICHAEL CONNELLY, SYLVIA
`DAY, JONATHAN FRANZEN, JOHN
`GRISHAM, ELIN HILDERBRAND,
`CHRISTINA BAKER KLINE, MAYA
`SHANBHAG LANG, VICTOR LAVALLE,
`GEORGE R.R. MARTIN, JODI PICOULT,
`DOUGLAS PRESTON, ROXANA ROBINSON,
`GEORGE SAUNDERS, SCOTT TUROW, and
`RACHEL VAIL, individually and on behalf of
`others similarly situated,
`Plaintiffs,
`v.
`OPENAI INC., OPENAI LP, OPENAI LLC,
`OPENAI GP LLC, OPENAI OPCO LLC,
`OPENAI GLOBAL LLC, OAI CORPORATION
`LLC, OPENAI HOLDINGS LLC, OPENAI
`STARTUP FUND I LP, OPENAI STARTUP
`FUND GP I LLC, and OPENAI STARTUP
`FUND MANAGEMENT LLC,
`Defendants.
`
`
`
`
`
`-1-
`
`
`
`

`

`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 2 of 47
`
`INTRODUCTORY STATEMENT
`
`1.
`
`Plaintiffs, authors of a broad array of works of fiction, bring this action under the
`
`Copyright Act seeking redress for Defendants’ flagrant and harmful infringements of Plaintiffs’
`
`registered copyrights in written works of fiction. Defendants copied Plaintiffs’ works wholesale,
`
`without permission or consideration. Defendants then fed Plaintiffs’ copyrighted works into their
`
`“large language models” or “LLMs,” algorithms designed to output human-seeming text
`
`responses to users’ prompts and queries. These algorithms are at the heart of Defendants’
`
`massive commercial enterprise. And at the heart of these algorithms is systematic theft on a mass
`
`scale.
`
`2.
`
`Plaintiffs seek to represent a class of professional fiction writers whose works
`
`spring from their own minds and their creative literary expression. These authors’ livelihoods
`
`derive from the works they create. But Defendants’ LLMs endanger fiction writers’ ability to
`
`make a living, in that the LLMs allow anyone to generate—automatically and freely (or very
`
`cheaply)—texts that they would otherwise pay writers to create. Moreover, Defendants’ LLMs
`
`can spit out derivative works: material that is based on, mimics, summarizes, or paraphrases
`
`Plaintiffs’ works, and harms the market for them.
`
`3.
`
`Unfairly, and perversely, without Plaintiffs’ copyrighted works on which to
`
`“train” their LLMs, Defendants would have no commercial product with which to damage—if
`
`not usurp—the market for these professional authors’ works. Defendants’ willful copying thus
`
`makes Plaintiffs’ works into engines of their own destruction.
`
`4.
`
`Defendants could have “trained” their LLMs on works in the public domain. They
`
`could have paid a reasonable licensing fee to use copyrighted works. What Defendants could not
`
`do was evade the Copyright Act altogether to power their lucrative commercial endeavor, taking
`
`whatever datasets of relatively recent books they could get their hands on without authorization.
`
`
`
`-2-
`
`
`
`

`

`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 3 of 47
`
`There is nothing fair about this. Defendants’ unauthorized use of Plaintiffs’ copyrighted works
`
`thus presents a straightforward infringement case applying well-established law to well-
`
`recognized copyright harms.
`
`5.
`
`Defendants’ chief executive Sam Altman has told Congress that he shares
`
`Plaintiffs’ concerns. According to Altman, “Ensuring that the creator economy continues to be
`
`vibrant is an important priority for OpenAI. ... OpenAI does not want to replace creators. We
`
`want our systems to be used to empower creativity, and to support and augment the essential
`
`humanity of artists and creators.”1 Altman testified that OpenAI “think[s] that creators deserve
`
`control over how their creations are used” and that “content creators, content owners, need to
`
`benefit from this technology.”2 Altman also has represented that OpenAI has “licens[ed] content
`
`directly from content owners” for “training” purposes.3 Not so from Plaintiffs. As to them,
`
`Altman and Defendants have proved unwilling to turn these words into actions.
`
`6.
`
`Plaintiffs thus seek damages for the lost opportunity to license their works, and
`
`for the market usurpation Defendants have enabled by making Plaintiffs unwilling accomplices
`
`in their own replacement; and a permanent injunction to prevent these harms from recurring.
`
`7.
`
`Plaintiffs complain of Defendants, on personal knowledge as to matters relating to
`
`Plaintiffs themselves, and on information and belief based on their and their counsels’ reasonable
`
`investigation as to all other matters, as follows:
`
`
`1 Sam Altman, Questions for the Record, at 9–10 (June 22, 2023), available at
`https://www.judiciary.senate.gov/imo/media/doc/2023-05-16_-_qfr_responses_-_altman.pdf (last
`accessed Sept. 19, 2023).
`2 Oversight of A.I.: Rules for Artificial Intelligence: Hearing Before the S. Judiciary Comm.
`Subcomm. on Privacy, Tech. and the Law, 118th Cong. (2023) (testimony of OpenAI CEO Sam
`Altman), available at https://techpolicy.press/transcript-senate-judiciary-subcommittee-hearing-
`on-oversight-of-ai (last accessed Sept. 19, 2023).
`3 Altman, Questions for the Record, supra, at 10.
`
`
`
`-3-
`
`
`
`

`

`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 4 of 47
`
`JURISDICTION AND VENUE
`
`8.
`
`The Court has jurisdiction over the subject matter of this action under 28 U.S.C.
`
`§ 1338(a) because the action arises under the Copyright Act.
`
`9.
`
`Venue is proper in this District under 28 U.S.C. § 1391(b)(2) because a
`
`substantial part of the events giving rise to Plaintiffs’ claim occurred here.
`
`10.
`
`Venue is also proper in this District under 28 U.S.C. § 1400(a) because
`
`Defendants or their agents reside or may be found here.
`
`PARTIES
`
`I.
`
`Plaintiffs
`
`11.
`
`Plaintiff The Authors Guild is a nonprofit 501(c)(6) organization based in New
`
`York, New York.
`
`12.
`
`13.
`
`14.
`
`15.
`
`16.
`
`17.
`
`18.
`
`Plaintiff David Baldacci is an author and a resident of Vienna, Virginia.
`
`Plaintiff Mary Bly is an author and a resident of New York, New York.
`
`Plaintiff Michael Connelly is an author and a resident of Tampa, Florida.
`
`Plaintiff Sylvia Day is an author and a resident of Las Vegas, Nevada.
`
`Plaintiff Jonathan Franzen is an author and a resident of Santa Cruz, California.
`
`Plaintiff John Grisham is an author and a resident of Charlottesville, Virginia.
`
`Plaintiff Elin Hilderbrand is an author and a resident of Nantucket Island,
`
`Massachusetts.
`
`19.
`
`Plaintiff Christina Baker Kline is an author and a resident of New York, New
`
`York.
`
`20.
`
`Plaintiff Maya Shanbhag Lang is an author and a resident of Sleepy Hollow,
`
`New York.
`
`21.
`
`Plaintiff Victor LaValle is an author and a resident of New York, New York.
`
`
`
`-4-
`
`
`
`

`

`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 5 of 47
`
`22.
`
`Plaintiff George R.R. Martin is an author and a resident of Santa Fe, New
`
`Mexico.
`
`23.
`
`24.
`
`25.
`
`26.
`
`California.
`
`27.
`
`28.
`
`Plaintiff Jodi Picoult is an author and a resident of Hanover, New Hampshire.
`
`Plaintiff Douglas Preston is an author and a resident of Santa Fe, New Mexico.
`
`Plaintiff Roxana Robinson is an author and a resident of New York, New York.
`
`Plaintiff George Saunders is an author and a resident of Santa Monica,
`
`Plaintiff Scott Turow is an author and a resident of Naples, Florida.
`
`Plaintiff Rachel Vail is an author and a resident of New York, New York.
`
`II.
`
`Defendants (Collectively, “OpenAI” or “the OpenAI Defendants”)
`
`29.
`
`The OpenAI Defendants are a tangled thicket of interlocking entities that
`
`generally keep from the public what the precise relationships among them are and what function
`
`each entity serves within the larger corporate structure.
`
`30.
`
`Defendant OpenAI Inc. is a Delaware corporation with its principal place of
`
`business in San Francisco, California.
`
`31.
`
`32.
`
`OpenAI Inc. was founded as a nonprofit research entity in 2015.
`
`Defendant OpenAI LP is a limited partnership formed under the laws of
`
`Delaware with its principal place of business in San Francisco, California.
`
`33.
`
`34.
`
`35.
`
`OpenAI LP was founded in 2019 to be the profit-making arm of OpenAI.
`
`OpenAI LP’s general partner is OpenAI Inc., via Defendant OpenAI GP LLC.
`
`Defendant OpenAI GP LLC is a limited liability company formed under the laws
`
`of Delaware with its principal place of business in San Francisco, California.
`
`36.
`
`OpenAI GP LLC is the vehicle through which OpenAI Inc. controls OpenAI LP.
`
`
`
`-5-
`
`
`
`

`

`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 6 of 47
`
`37.
`
`Defendant OpenAI LLC is a limited liability company formed under the laws of
`
`Delaware with its principal place of business in San Francisco, California.
`
`38.
`
`39.
`
`40.
`
`OpenAI LLC owns some or all of the services and products provided by OpenAI.
`
`The sole member of OpenAI LLC is Defendant OpenAI OpCo LLC.
`
`Defendant OpenAI OpCo LLC is a limited liability company formed under the
`
`laws of Delaware with its principal place of business in San Francisco, California.
`
`41.
`
`42.
`
`The sole member of OpenAI OpCo LLC is Defendant OpenAI Global LLC.
`
`Defendant OpenAI Global LLC is a limited liability company formed under the
`
`laws of Delaware with its principal place of business in San Francisco, California.
`
`43.
`
`OpenAI Global’s members are Microsoft Corporation and Defendant OAI
`
`Corporation LLC.
`
`44.
`
`Defendant OAI Corporation LLC is a limited liability company formed under
`
`the laws of Delaware with its principal place of business in San Francisco, California.
`
`45.
`
`46.
`
`OAI Corporation’s only member is Defendant OpenAI Holdings LLC.
`
`Defendant OpenAI Holdings LLC is a limited liability company formed under
`
`the laws of Delaware with its principal place of business in San Francisco, California.
`
`47.
`
`The members of OpenAI Holdings LLC are Defendant OpenAI Inc. and Aestas
`
`LLC, an OpenAI-related limited liability company that is not a defendant here.
`
`48.
`
`Defendant OpenAI Startup Fund I LP is a limited partnership formed under the
`
`laws of Delaware with its principal place of business in San Francisco, California.
`
`49.
`
`Defendant OpenAI Startup Fund GP I LLC is a limited liability company
`
`formed under the laws of Delaware with its principal place of business in San Francisco,
`
`California.
`
`
`
`-6-
`
`
`
`

`

`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 7 of 47
`
`50.
`
`Defendant OpenAI Startup Fund Management LLC is a limited liability
`
`company formed under the laws of Delaware with its principal place of business in San
`
`Francisco, California.
`
`GENERAL FACTUAL ALLEGATIONS
`
`I.
`
`Generative AI and Large Language Models
`
`51.
`
`The terms “artificial intelligence” or “AI” refer generally to computer systems
`
`designed to imitate human cognitive functions.
`
`52.
`
`The terms “generative artificial intelligence” or “generative AI” refer specifically
`
`to systems that are capable of generating “new” content in response to user inputs called
`
`“prompts.”
`
`53.
`
`For example, the user of a generative AI system capable of generating images
`
`from text prompts might input the prompt, “A lawyer working at her desk.” The system would
`
`then attempt to construct the prompted image. Similarly, the user of a generative AI system
`
`capable of generating text from text prompts might input the prompt, “Tell me a story about a
`
`lawyer working at her desk.” The system would then attempt to generate the prompted text.
`
`54.
`
`Recent generative AI systems designed to recognize input text and generate
`
`output text are built on “large language models” or “LLMs.”
`
`55.
`
`LLMs use predictive algorithms that are designed to detect statistical patterns in
`
`the text datasets on which they are “trained” and, on the basis of these patterns, generate
`
`responses to user prompts. “Training” an LLM refers to the process by which the parameters that
`
`define an LLM’s behavior are adjusted through the LLM’s ingestion and analysis of large
`
`“training” datasets.
`
`56.
`
`Once “trained,” the LLM analyzes the relationships among words in an input
`
`prompt and generates a response that is an approximation of similar relationships among words
`
`
`
`-7-
`
`
`
`

`

`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 8 of 47
`
`in the LLM’s “training” data. In this way, LLMs can be capable of generating sentences,
`
`paragraphs, and even complete texts, from cover letters to novels.
`
`57.
`
`“Training” an LLM requires supplying the LLM with large amounts of text for
`
`the LLM to ingest—the more text, the better. That is, in part, the large in large language model.
`
`58.
`
`As the U.S. Patent and Trademark Office has observed, LLM “training” “almost
`
`by definition involve[s] the reproduction of entire works or substantial portions thereof.”4
`
`59.
`
`“Training” in this context is therefore a technical-sounding euphemism for
`
`“copying and ingesting.”
`
`60.
`
`The quality of the LLM (that is, its capacity to generate human-seeming responses
`
`to prompts) is dependent on the quality of the datasets used to “train” the LLM.
`
`61.
`
`Professionally authored, edited, and published books—such as those authored by
`
`Plaintiffs here—are an especially important source of LLM “training” data.
`
`62.
`
`As one group of AI researchers (not affiliated with Defendants) has observed,
`
`“[b]ooks are a rich source of both fine-grained information, how a character, an object or a scene
`
`looks like, as well as high-level semantics, what someone is thinking, feeling and how these
`
`states evolve through a story.”5
`
`63.
`
`In other words, books are the high-quality materials Defendants want, need, and
`
`have therefore outright pilfered to develop generative AI products that produce high-quality
`
`results: text that appears to have been written by a human writer.
`
`
`4 U.S. Patent & Trademark Office, Public Views on Artificial Intelligence and Intellectual
`Property Policy 29 (2020), available at
`https://www.uspto.gov/sites/default/files/documents/USPTO_AI-Report_2020-10-07.pdf (last
`accessed Sept. 19, 2023).
`5 Yukun Zhu et al., Aligning Books and Movies: Towards Story-like Visual Explanations by
`Watching Movies and Reading Books 1 (2015), available at https://arxiv.org/pdf/1506.06724.pdf
`(last accessed Sept. 19, 2023).
`
`
`
`-8-
`
`
`
`

`

`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 9 of 47
`
`64.
`
`This use is highly commercial.
`
`II.
`
`OpenAI’s Willful Infringement of Plaintiffs’ Copyrights
`
`A.
`
`65.
`
`OpenAI
`
`OpenAI (specifically, Defendant OpenAI Inc.) was founded in 2015 as a non-
`
`profit organization with the self-professed goal of researching and developing AI tools
`
`“unconstrained by a need to generate financial return.”6
`
`66.
`
`Four years later, in 2019, OpenAI relaunched itself (specifically, through
`
`Defendant OpenAI GP LLC and Defendant OpenAI LP) as a for-profit enterprise.
`
`67.
`
`Investments began pouring in. Microsoft Corporation, one of the world’s largest
`
`technology companies, invested $1 billion in 2019, an estimated $2 billion in 2021, and a
`
`staggering $10 billion in 2023, for a total investment of $13 billion.
`
`68.
`
`B.
`
`69.
`
`Industry observers currently value OpenAI at around $29 billion.
`
`GPT-N and ChatGPT
`
`OpenAI’s LLMs are collectively referred to as “GPT-N,” which stands for
`
`“Generative Pre-trained Transformer” (a specific type of LLM architecture), followed by a
`
`version number.
`
`70.
`
`71.
`
`72.
`
`GPT-3 was released in 2020 and exclusively licensed to Microsoft the same year.
`
`OpenAI further refined GPT-3 into GPT-3.5, which was released in 2022.
`
`In November 2022, OpenAI released ChatGPT, a consumer-facing chatbot
`
`application built on GPT-3.5.
`
`
`6 OpenAI, Introducing OpenAI (Dec. 11, 2015), https://openai.com/blog/introducing-openai (last
`accessed Sept. 19, 2023).
`
`
`
`-9-
`
`
`
`

`

`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 10 of 47
`
`73.
`
`ChatGPT’s popularity exploded virtually overnight. By January 2023, less than
`
`three months after its release, the application had an estimated 100 million monthly active users,
`
`making it one of the fastest-growing consumer applications in history.
`
`74.
`
`75.
`
`GPT-4, the successor to GPT-3.5, was released in March 2023.
`
`GPT-4 underlies OpenAI’s new subscription-based chatbot, called ChatGPT Plus,
`
`which is available to consumers for $20 per month.
`
`76.
`
`Defendants intend to earn billions of dollars from this technology.
`
`77. When announcing the release of ChatGPT Enterprise, a subscription-based high-
`
`capability GPT-4 application targeted for corporate clients, in August 2023, Defendants claimed
`
`that teams in “over 80% of Fortune 500 companies” were using its products.7
`
`78.
`
`GPT-4 also underlies Microsoft’s Bing Chat product, offered through its Bing
`
`Internet search engine.
`
`C.
`
`79.
`
`Knowingly “Training” GPT-N on Copyrighted Books
`
`OpenAI does not disclose or publicize with specificity what datasets GPT-3,
`
`GPT-3.5, or GPT-4 were “trained” on. Despite its name, OpenAI treats that information as
`
`proprietary.
`
`80.
`
`To “train” its LLMs—including GPT-3, GPT-3.5, and GPT-4—OpenAI has
`
`reproduced copyrighted books—including copyrighted books authored by Plaintiffs here—
`
`without their authors’ consent.
`
`81.
`
`OpenAI has admitted as much.
`
`
`7 OpenAI, Introducing ChatGPT Enterprise (Aug. 28, 2023),
`https://openai.com/blog/introducing-chatgpt-enterprise (last accessed Sept. 19, 2023).
`
`
`
`-10-
`
`
`
`

`

`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 11 of 47
`
`82.
`
`OpenAI has admitted that it has “trained” its LLMs on “large, publicly available
`
`datasets that include copyrighted works.”8
`
`83.
`
`Again: OpenAI’s “training” data is “derived from existing publicly accessible
`
`‘corpora’ ... of data that include copyrighted works.”9
`
`84.
`
`OpenAI has admitted that “training” LLMs “require[s] large amounts of data,”
`
`and that “analyzing large corpora” of data “necessarily involves first making copies of the data to
`
`be analyzed.”10
`
`85.
`
`OpenAI has admitted that, if it refrained from using copyrighted works in its
`
`LLMs’ “training,” it would “lead to significant reductions in model quality.”11
`
`86.
`
`Accordingly, OpenAI has openly admitted to reproducing copyrighted works in
`
`the course of “training” its LLMs because such reproduction is central to the quality of its
`
`products.
`
`87.
`
`ChatGPT itself admits as much. In response to a query submitted to it in January
`
`2023, the chatbot responded,
`
`It is possible that some of the books used to train me were under
`copyright. However, my training data was sourced from various
`publicly available sources on the internet, and it is likely that some
`of the books included in my training dataset were not authorized to
`be used. ... If any copyrighted material was included in my training
`data, it would have been used without the knowledge or consent of
`the copyright holder.
`
`
`8 OpenAI, Comment Regarding Request for Comments on Intellectual Property Protection for
`Artificial Intelligence Innovation, U.S. Patent and Trademark Office Dkt. No. PTO-C-2019-
`0038, at 1 (2019), available at
`https://www.uspto.gov/sites/default/files/documents/OpenAI_RFC-84-FR-58141.pdf (last
`accessed Sept. 19, 2023).
`9 Id. at 2.
`10 Id.
`11 Id. at 7 n.33.
`
`
`
`-11-
`
`
`
`

`

`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 12 of 47
`
`88.
`
`Until very recently, ChatGPT could be prompted to return quotations of text from
`
`copyrighted books with a good degree of accuracy, suggesting that the underlying LLM must
`
`have ingested these books in their entireties during its “training.”
`
`89.
`
`Now, however, ChatGPT generally responds to such prompts with the statement,
`
`“I can’t provide verbatim excerpts from copyrighted texts.” Thus, while ChatGPT previously
`
`provided such excerpts and in principle retains the capacity to do so, it has been restrained from
`
`doing so, if only temporarily, by its programmers.
`
`90.
`
`In light of its timing, this apparent revision of ChatGPT’s output rules is likely a
`
`response to the type of activism on behalf of authors exemplified by the Open Letter addressed to
`
`OpenAI and other companies by Plaintiff The Authors Guild, which is discussed further below.
`
`91.
`
`Instead of “verbatim excerpts,” ChatGPT now offers to produce a summary of the
`
`copyrighted book, which usually contains details not available in reviews and other publicly
`
`available material—again suggesting that the underlying LLM must have ingested the entire
`
`book during its “training.”
`
`92.
`
`OpenAI is characteristically opaque about where and how it procured the entirety
`
`of these books, including Plaintiffs’ copyrighted works.
`
`93.
`
`94.
`
`OpenAI has discussed limited details about the datasets used to “train” GPT-3.
`
`OpenAI admits that among the “training” datasets it used to “train” the model
`
`were “Common Crawl,” and two “high-quality,” “internet-based books corpora” which it calls
`
`“Books1” and “Books2.”12
`
`
`12 Tom B. Brown et al., Language Models Are Few-Shot Learners 8 (2020), available at
`https://arxiv.org/pdf/2005.14165.pdf (last accessed Sept. 19, 2023).
`
`
`
`-12-
`
`
`
`

`

`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 13 of 47
`
`95.
`
`Common Crawl is a vast and growing corpus of “raw web page data, metadata
`
`extracts, and text extracts” scraped from billions of web pages. It is widely used in “training”
`
`LLMs, and has been used to “train,” in addition to GPT-N, Meta’s LlaMa, and Google’s BERT.
`
`It is known to contain text from books copied from pirate sites.13
`
`96.
`
`97.
`
`OpenAI refuses to discuss the source or sources of the Books2 dataset.
`
`Some independent AI researchers suspect that Books2 contains or consists of
`
`ebook files downloaded from large pirate book repositories such as Library Genesis or
`
`“LibGen,” “which offers a vast repository of pirated text.”14
`
`98.
`
`99.
`
`LibGen is already known to this Court as a notorious copyright infringer.15
`
`Other possible candidates for Books2’s sources include Z-Library, another large
`
`pirate book repository that hosts more than 11 million books, and pirate torrent trackers like
`
`Bibliotik, which allow users to download ebooks in bulk.
`
`100. Websites linked to Z-Library appear in the Common Crawl corpus and have been
`
`included in the “training” dataset of other LLMs.16
`
`101. Z-Library’s Internet domains were seized by the FBI in February 2022, only
`
`months after OpenAI stopped “training” GPT-3.5 in September 2021.
`
`
`13 Alex Hern, Fresh Concerns Raised Over Sources of Training Material for AI Systems, The
`Guardian (Apr. 20, 2023), available at
`https://www.theguardian.com/technology/2023/apr/20/fresh-concerns-training-material-ai-
`systems-facist-pirated-malicious (last accessed Sept. 19, 2023).
`14 Kate Knibbs, The Battle Over Books3 Could Change AI Forever, Wired (Sept. 4, 2023),
`available at https://www.wired.com/story/battle-over-books3 (last accessed Sept. 19, 2023).
`15 See Elsevier Inc. v. Sci-Hub, No. 1:15-cv-4282-RWS (S.D.N.Y.).
`16 Kevin Schaul et al., Inside the Secret List of Websites that Make AI Like ChatGPT Sounds
`Smart, The Washington Post (Apr. 19, 2023), available at
`https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning (last accessed
`Sept. 19, 2023).
`
`
`
`-13-
`
`
`
`

`

`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 14 of 47
`
`102. The disclosed size of the Books2 dataset (55 billion “tokens,” the basic units of
`
`textual meaning such as words, syllables, numbers, and punctuation marks) suggests it comprises
`
`over 100,000 books.
`
`103.
`
`“Books3,” a dataset compiled by an independent AI researcher, is comprised of
`
`nearly 200,000 books downloaded from Bibliotik, and has been used by other AI developers to
`
`“train” LLMs.
`
`104. The similarities in the sizes of Books2 and Books3, and the fact that there are only
`
`a few pirate repositories on the Internet that allow bulk ebook downloads, strongly indicates that
`
`the books contained in Books2 were also obtained from one of the notorious repositories
`
`discussed above.
`
`105. OpenAI has not discussed the datasets used to “train” GPT-3.5, GPT-4, or their
`
`source or sources.
`
`106. GPT-3.5 and GPT-4 are significantly more powerful than their predecessors. GPT
`
`3.5 contains roughly 200 billion parameters, and GPT 4 contains roughly 1.75 trillion
`
`parameters, compared to GPT-3’s roughly 175 billion parameters.
`
`107. The growth in power and sophistication from GPT-3 to GPT-4 suggests a
`
`correlative growth in the size of the “training” datasets, raising the inference that one or more
`
`very large sources of pirated ebooks discussed above must have been used to “train” GPT-4.
`
`108. There is no other way OpenAI could have obtained the volume of books required
`
`to “train” a powerful LLM like GPT-4.
`
`
`
`-14-
`
`
`
`

`

`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 15 of 47
`
`109.
`
`In short, OpenAI admits it needs17 and uses18 “large, publicly available datasets
`
`that include copyrighted works”19—and specifically, “high-quality”20 copyrighted books—to
`
`“train” its LLMs; pirated sources of such “training” data are readily available; and one or more
`
`of these sources contain Plaintiffs’ works.
`
`110. Defendants knew that their “training” data included texts protected by copyright
`
`but willfully proceeded without obtaining authorization.
`
`D.
`
`GPT-N’s and ChatGPT’s Harm to Authors
`
`111. ChatGPT and the LLMs underlying it seriously threaten the livelihood of the very
`
`authors—including Plaintiffs here, as discussed specifically below—on whose works they were
`
`“trained” without the authors’ consent.
`
`112. Goldman Sachs estimates that generative AI could replace 300 million full-time
`
`jobs in the near future, or one-fourth of the labor currently performed in the United States and
`
`Europe.
`
`113. Already, writers report losing income from copywriting, journalism, and online
`
`content writing—important sources of income for many book authors. The Authors Guild’s most
`
`recent authors earnings study21 shows a median writing-related income for full-time authors of
`
`just over $20,000, and that full-time traditional authors earn only half of that from their books.
`
`
`17 OpenAI, Comment Regarding Request for Comments, supra, at 7 n.33.
`18 Id. at 2.
`19 Id. at 1.
`20 Brown et al., Few-Shot Learners, supra, at 8.
`21 Authors Guild, “Top Takeaways from the 2023 Author Income Survey (2023),
`https://authorsguild.org/news/top-takeaways-from-2023-author-income-survey (last accessed
`Sept. 19, 2023).
`
`
`
`-15-
`
`
`
`

`

`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 16 of 47
`
`The rest comes from activities like content writing—work that is starting to dry up as a result of
`
`generative AI systems like ChatGPT.
`
`114. An Authors Guild member who writes marketing and web content reported losing
`
`75 percent of their work as a result of clients switching to AI.
`
`115. Another content writer (unrelated to the Plaintiffs here) told the Washington Post
`
`that half of his annual income (generated by ten client contracts) was erased when the clients
`
`elected to use ChatGPT instead.22
`
`116. Recently, the owner of popular online publications such as Gizmodo, Deadspin,
`
`The Root, Jezebel and The Onion came under fire for publishing an error-riddled, AI-generated
`
`piece, leading the Writers Guild of America to demand “an immediate end of AI-generated
`
`articles” on the company’s properties.23
`
`117.
`
`In a survey of authors conducted by The Authors Guild in March 2023 (early in
`
`ChatGPT’s lifecycle), 69 percent of respondents said they consider generative AI a threat to their
`
`profession, and 90 percent said they believe that writers should be compensated for the use of
`
`their work in “training” AI.
`
`118. As explained above, until recently, ChatGPT provided verbatim quotes of
`
`copyrighted text. Currently, it instead readily offers to produce summaries of such text. These
`
`summaries are themselves derivative works, the creation of which is inherently based on the
`
`
`22 Pranshu Verma & Gerrit De Vynck, ChatGPT Took Their Jobs. Now They Walk Dogs and Fix
`Air Conditioners, The Washington Post (June 2, 2023), available at
`https://www.washingtonpost.com/technology/2023/06/02/ai-taking-jobs (last accessed Sept. 19,
`2023).
`23 Todd Spangler, WGA Slams G/O Media’s AI-Generated Articles as ‘Existential Threat to
`Journalism,’ Demands Company End Practice, Variety (July 12, 2023),
`https://variety.com/2023/digital/news/wga-slams-go-media-ai-generated-articles-existential-
`threat-1235668496 (last accessed Sept. 19, 2023).
`
`
`
`-16-
`
`
`
`

`

`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 17 of 47
`
`original unlawfully copied work and could be—but for ChatGPT—licensed by the authors of the
`
`underlying works to willing, paying licensees.
`
`119. ChatGPT creates other outputs that are derivative of authors’ copyrighted works.
`
`Businesses are sprouting up to sell prompts that allow users to enter the world of an author’s
`
`books and create derivative stories within that world. For example, a business called Socialdraft
`
`offers long prompts that lead ChatGPT to engage in “conversations” with popular fiction authors
`
`like Plaintiff Grisham, Plaintiff Martin, Margaret Atwood, Dan Brown, and others about their
`
`works, as well as prompts that promise to help customers “Craft Bestselling Books with AI.”
`
`120. OpenAI allows third parties to build their own applications on top of ChatGPT by
`
`making it available through an “application programming interface” or “API.” Applications
`
`integrated with the API allow users to generate works of fiction, including books and stories
`
`similar to those of Plaintiffs and other authors.24
`
`121. ChatGPT is being used to generate low-quality ebooks, impersonating authors,
`
`and displacing human-authored books.25 For example, author Jane Friedman discovered “a cache
`
`of garbage books” written under her name for sale on Amazon.26
`
`122. Plaintiffs and other professional writers are thus reasonably concerned about the
`
`risks OpenAI’s conduct poses to their livelihoods specifically and the literary arts generally.
`
`
`24 Adi Robertson, I Tried the AI Novel-Writing Tool Everyone Hates, and It’s Better than I
`Expected, The Verge (May 24, 2023),
`https://www.theverge.com/2023/5/24/23732252/sudowrite-story-engine-ai-generated-cyberpunk-
`novella (last accessed Sept. 19, 2023).
`25 Jules Roscoe, AI-Generated Books of Nonsense Are All Over Amazon’s Bestseller Lists, Vice
`(June 28, 2023), https://www.vice.com/en/article/v7b774/ai-generated-books-of-nonsense-are-
`all-over-amazons-bestseller-lists (last accessed Sept. 19, 2023).
`26 Pilar Melendez, Famous Author Jane Friedman Finds AI Fakes Being Sold Under Her Name
`on Amazon, The Daily Beast (Aug. 8, 2023), https://www.thedailybeast.com/author-jane-
`friedman-finds-ai-fakes-being-sold-under-her-name-on-amazon (last accessed Sept. 19, 2023).
`
`
`
`-17-
`
`
`
`

`

`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 18 of 47
`
`123. Plaintiff The Authors Guild, among others, has given voice to these concerns on
`
`behalf of working American authors.
`
`124. The Authors Guild is the nation’s oldest and largest professional writers’
`
`organization. It “exists to support working writers and their ability to earn a living from
`
`authorship.”27
`
`125. Among other principles, The Authors Guild holds that “authors should not be
`
`required to write or speak without compensation. Writers, like all professionals, should receive
`
`fair payment for thei

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket