`
`Rachel Geman
`LIEFF CABRASER HEIMANN
` & BERNSTEIN, LLP
`250 Hudson Street, 8th Floor
`New York, NY 10013-1413
`Telephone: 212.355.9500
`rgeman@lchb.com
`
`Reilly T. Stoler (pro hac vice forthcoming)
`Ian R. Bensberg (pro hac vice forthcoming)
`LIEFF CABRASER HEIMANN
` & BERNSTEIN, LLP
`275 Battery Street, 29th Floor
`San Francisco, CA 94111-3339
`Telephone: 415.956.1000
`rstoler@lchb.com
`ibensberg@lchb.com
`
`Attorneys for Plaintiffs and the Proposed Class
`
`Scott J. Sholder
`CeCe M. Cole
`COWAN DEBAETS ABRAHAMS
` & SHEPPARD LLP
`41 Madison Avenue, 38th Floor
`New York, New York 10010
`Telephone: 212.974.7474
`ssholder@cdas.com
`ccole@cdas.com
`
`
`
`
`UNITED STATES DISTRICT COURT
`SOUTHERN DISTRICT OF NEW YORK
`
`No. 1:23-cv-8292
`
`
`CLASS ACTION COMPLAINT
`
`
`JURY TRIAL DEMANDED
`
`
`
`
`AUTHORS GUILD, DAVID BALDACCI,
`MARY BLY, MICHAEL CONNELLY, SYLVIA
`DAY, JONATHAN FRANZEN, JOHN
`GRISHAM, ELIN HILDERBRAND,
`CHRISTINA BAKER KLINE, MAYA
`SHANBHAG LANG, VICTOR LAVALLE,
`GEORGE R.R. MARTIN, JODI PICOULT,
`DOUGLAS PRESTON, ROXANA ROBINSON,
`GEORGE SAUNDERS, SCOTT TUROW, and
`RACHEL VAIL, individually and on behalf of
`others similarly situated,
`Plaintiffs,
`v.
`OPENAI INC., OPENAI LP, OPENAI LLC,
`OPENAI GP LLC, OPENAI OPCO LLC,
`OPENAI GLOBAL LLC, OAI CORPORATION
`LLC, OPENAI HOLDINGS LLC, OPENAI
`STARTUP FUND I LP, OPENAI STARTUP
`FUND GP I LLC, and OPENAI STARTUP
`FUND MANAGEMENT LLC,
`Defendants.
`
`
`
`
`
`-1-
`
`
`
`
`
`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 2 of 47
`
`INTRODUCTORY STATEMENT
`
`1.
`
`Plaintiffs, authors of a broad array of works of fiction, bring this action under the
`
`Copyright Act seeking redress for Defendants’ flagrant and harmful infringements of Plaintiffs’
`
`registered copyrights in written works of fiction. Defendants copied Plaintiffs’ works wholesale,
`
`without permission or consideration. Defendants then fed Plaintiffs’ copyrighted works into their
`
`“large language models” or “LLMs,” algorithms designed to output human-seeming text
`
`responses to users’ prompts and queries. These algorithms are at the heart of Defendants’
`
`massive commercial enterprise. And at the heart of these algorithms is systematic theft on a mass
`
`scale.
`
`2.
`
`Plaintiffs seek to represent a class of professional fiction writers whose works
`
`spring from their own minds and their creative literary expression. These authors’ livelihoods
`
`derive from the works they create. But Defendants’ LLMs endanger fiction writers’ ability to
`
`make a living, in that the LLMs allow anyone to generate—automatically and freely (or very
`
`cheaply)—texts that they would otherwise pay writers to create. Moreover, Defendants’ LLMs
`
`can spit out derivative works: material that is based on, mimics, summarizes, or paraphrases
`
`Plaintiffs’ works, and harms the market for them.
`
`3.
`
`Unfairly, and perversely, without Plaintiffs’ copyrighted works on which to
`
`“train” their LLMs, Defendants would have no commercial product with which to damage—if
`
`not usurp—the market for these professional authors’ works. Defendants’ willful copying thus
`
`makes Plaintiffs’ works into engines of their own destruction.
`
`4.
`
`Defendants could have “trained” their LLMs on works in the public domain. They
`
`could have paid a reasonable licensing fee to use copyrighted works. What Defendants could not
`
`do was evade the Copyright Act altogether to power their lucrative commercial endeavor, taking
`
`whatever datasets of relatively recent books they could get their hands on without authorization.
`
`
`
`-2-
`
`
`
`
`
`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 3 of 47
`
`There is nothing fair about this. Defendants’ unauthorized use of Plaintiffs’ copyrighted works
`
`thus presents a straightforward infringement case applying well-established law to well-
`
`recognized copyright harms.
`
`5.
`
`Defendants’ chief executive Sam Altman has told Congress that he shares
`
`Plaintiffs’ concerns. According to Altman, “Ensuring that the creator economy continues to be
`
`vibrant is an important priority for OpenAI. ... OpenAI does not want to replace creators. We
`
`want our systems to be used to empower creativity, and to support and augment the essential
`
`humanity of artists and creators.”1 Altman testified that OpenAI “think[s] that creators deserve
`
`control over how their creations are used” and that “content creators, content owners, need to
`
`benefit from this technology.”2 Altman also has represented that OpenAI has “licens[ed] content
`
`directly from content owners” for “training” purposes.3 Not so from Plaintiffs. As to them,
`
`Altman and Defendants have proved unwilling to turn these words into actions.
`
`6.
`
`Plaintiffs thus seek damages for the lost opportunity to license their works, and
`
`for the market usurpation Defendants have enabled by making Plaintiffs unwilling accomplices
`
`in their own replacement; and a permanent injunction to prevent these harms from recurring.
`
`7.
`
`Plaintiffs complain of Defendants, on personal knowledge as to matters relating to
`
`Plaintiffs themselves, and on information and belief based on their and their counsels’ reasonable
`
`investigation as to all other matters, as follows:
`
`
`1 Sam Altman, Questions for the Record, at 9–10 (June 22, 2023), available at
`https://www.judiciary.senate.gov/imo/media/doc/2023-05-16_-_qfr_responses_-_altman.pdf (last
`accessed Sept. 19, 2023).
`2 Oversight of A.I.: Rules for Artificial Intelligence: Hearing Before the S. Judiciary Comm.
`Subcomm. on Privacy, Tech. and the Law, 118th Cong. (2023) (testimony of OpenAI CEO Sam
`Altman), available at https://techpolicy.press/transcript-senate-judiciary-subcommittee-hearing-
`on-oversight-of-ai (last accessed Sept. 19, 2023).
`3 Altman, Questions for the Record, supra, at 10.
`
`
`
`-3-
`
`
`
`
`
`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 4 of 47
`
`JURISDICTION AND VENUE
`
`8.
`
`The Court has jurisdiction over the subject matter of this action under 28 U.S.C.
`
`§ 1338(a) because the action arises under the Copyright Act.
`
`9.
`
`Venue is proper in this District under 28 U.S.C. § 1391(b)(2) because a
`
`substantial part of the events giving rise to Plaintiffs’ claim occurred here.
`
`10.
`
`Venue is also proper in this District under 28 U.S.C. § 1400(a) because
`
`Defendants or their agents reside or may be found here.
`
`PARTIES
`
`I.
`
`Plaintiffs
`
`11.
`
`Plaintiff The Authors Guild is a nonprofit 501(c)(6) organization based in New
`
`York, New York.
`
`12.
`
`13.
`
`14.
`
`15.
`
`16.
`
`17.
`
`18.
`
`Plaintiff David Baldacci is an author and a resident of Vienna, Virginia.
`
`Plaintiff Mary Bly is an author and a resident of New York, New York.
`
`Plaintiff Michael Connelly is an author and a resident of Tampa, Florida.
`
`Plaintiff Sylvia Day is an author and a resident of Las Vegas, Nevada.
`
`Plaintiff Jonathan Franzen is an author and a resident of Santa Cruz, California.
`
`Plaintiff John Grisham is an author and a resident of Charlottesville, Virginia.
`
`Plaintiff Elin Hilderbrand is an author and a resident of Nantucket Island,
`
`Massachusetts.
`
`19.
`
`Plaintiff Christina Baker Kline is an author and a resident of New York, New
`
`York.
`
`20.
`
`Plaintiff Maya Shanbhag Lang is an author and a resident of Sleepy Hollow,
`
`New York.
`
`21.
`
`Plaintiff Victor LaValle is an author and a resident of New York, New York.
`
`
`
`-4-
`
`
`
`
`
`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 5 of 47
`
`22.
`
`Plaintiff George R.R. Martin is an author and a resident of Santa Fe, New
`
`Mexico.
`
`23.
`
`24.
`
`25.
`
`26.
`
`California.
`
`27.
`
`28.
`
`Plaintiff Jodi Picoult is an author and a resident of Hanover, New Hampshire.
`
`Plaintiff Douglas Preston is an author and a resident of Santa Fe, New Mexico.
`
`Plaintiff Roxana Robinson is an author and a resident of New York, New York.
`
`Plaintiff George Saunders is an author and a resident of Santa Monica,
`
`Plaintiff Scott Turow is an author and a resident of Naples, Florida.
`
`Plaintiff Rachel Vail is an author and a resident of New York, New York.
`
`II.
`
`Defendants (Collectively, “OpenAI” or “the OpenAI Defendants”)
`
`29.
`
`The OpenAI Defendants are a tangled thicket of interlocking entities that
`
`generally keep from the public what the precise relationships among them are and what function
`
`each entity serves within the larger corporate structure.
`
`30.
`
`Defendant OpenAI Inc. is a Delaware corporation with its principal place of
`
`business in San Francisco, California.
`
`31.
`
`32.
`
`OpenAI Inc. was founded as a nonprofit research entity in 2015.
`
`Defendant OpenAI LP is a limited partnership formed under the laws of
`
`Delaware with its principal place of business in San Francisco, California.
`
`33.
`
`34.
`
`35.
`
`OpenAI LP was founded in 2019 to be the profit-making arm of OpenAI.
`
`OpenAI LP’s general partner is OpenAI Inc., via Defendant OpenAI GP LLC.
`
`Defendant OpenAI GP LLC is a limited liability company formed under the laws
`
`of Delaware with its principal place of business in San Francisco, California.
`
`36.
`
`OpenAI GP LLC is the vehicle through which OpenAI Inc. controls OpenAI LP.
`
`
`
`-5-
`
`
`
`
`
`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 6 of 47
`
`37.
`
`Defendant OpenAI LLC is a limited liability company formed under the laws of
`
`Delaware with its principal place of business in San Francisco, California.
`
`38.
`
`39.
`
`40.
`
`OpenAI LLC owns some or all of the services and products provided by OpenAI.
`
`The sole member of OpenAI LLC is Defendant OpenAI OpCo LLC.
`
`Defendant OpenAI OpCo LLC is a limited liability company formed under the
`
`laws of Delaware with its principal place of business in San Francisco, California.
`
`41.
`
`42.
`
`The sole member of OpenAI OpCo LLC is Defendant OpenAI Global LLC.
`
`Defendant OpenAI Global LLC is a limited liability company formed under the
`
`laws of Delaware with its principal place of business in San Francisco, California.
`
`43.
`
`OpenAI Global’s members are Microsoft Corporation and Defendant OAI
`
`Corporation LLC.
`
`44.
`
`Defendant OAI Corporation LLC is a limited liability company formed under
`
`the laws of Delaware with its principal place of business in San Francisco, California.
`
`45.
`
`46.
`
`OAI Corporation’s only member is Defendant OpenAI Holdings LLC.
`
`Defendant OpenAI Holdings LLC is a limited liability company formed under
`
`the laws of Delaware with its principal place of business in San Francisco, California.
`
`47.
`
`The members of OpenAI Holdings LLC are Defendant OpenAI Inc. and Aestas
`
`LLC, an OpenAI-related limited liability company that is not a defendant here.
`
`48.
`
`Defendant OpenAI Startup Fund I LP is a limited partnership formed under the
`
`laws of Delaware with its principal place of business in San Francisco, California.
`
`49.
`
`Defendant OpenAI Startup Fund GP I LLC is a limited liability company
`
`formed under the laws of Delaware with its principal place of business in San Francisco,
`
`California.
`
`
`
`-6-
`
`
`
`
`
`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 7 of 47
`
`50.
`
`Defendant OpenAI Startup Fund Management LLC is a limited liability
`
`company formed under the laws of Delaware with its principal place of business in San
`
`Francisco, California.
`
`GENERAL FACTUAL ALLEGATIONS
`
`I.
`
`Generative AI and Large Language Models
`
`51.
`
`The terms “artificial intelligence” or “AI” refer generally to computer systems
`
`designed to imitate human cognitive functions.
`
`52.
`
`The terms “generative artificial intelligence” or “generative AI” refer specifically
`
`to systems that are capable of generating “new” content in response to user inputs called
`
`“prompts.”
`
`53.
`
`For example, the user of a generative AI system capable of generating images
`
`from text prompts might input the prompt, “A lawyer working at her desk.” The system would
`
`then attempt to construct the prompted image. Similarly, the user of a generative AI system
`
`capable of generating text from text prompts might input the prompt, “Tell me a story about a
`
`lawyer working at her desk.” The system would then attempt to generate the prompted text.
`
`54.
`
`Recent generative AI systems designed to recognize input text and generate
`
`output text are built on “large language models” or “LLMs.”
`
`55.
`
`LLMs use predictive algorithms that are designed to detect statistical patterns in
`
`the text datasets on which they are “trained” and, on the basis of these patterns, generate
`
`responses to user prompts. “Training” an LLM refers to the process by which the parameters that
`
`define an LLM’s behavior are adjusted through the LLM’s ingestion and analysis of large
`
`“training” datasets.
`
`56.
`
`Once “trained,” the LLM analyzes the relationships among words in an input
`
`prompt and generates a response that is an approximation of similar relationships among words
`
`
`
`-7-
`
`
`
`
`
`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 8 of 47
`
`in the LLM’s “training” data. In this way, LLMs can be capable of generating sentences,
`
`paragraphs, and even complete texts, from cover letters to novels.
`
`57.
`
`“Training” an LLM requires supplying the LLM with large amounts of text for
`
`the LLM to ingest—the more text, the better. That is, in part, the large in large language model.
`
`58.
`
`As the U.S. Patent and Trademark Office has observed, LLM “training” “almost
`
`by definition involve[s] the reproduction of entire works or substantial portions thereof.”4
`
`59.
`
`“Training” in this context is therefore a technical-sounding euphemism for
`
`“copying and ingesting.”
`
`60.
`
`The quality of the LLM (that is, its capacity to generate human-seeming responses
`
`to prompts) is dependent on the quality of the datasets used to “train” the LLM.
`
`61.
`
`Professionally authored, edited, and published books—such as those authored by
`
`Plaintiffs here—are an especially important source of LLM “training” data.
`
`62.
`
`As one group of AI researchers (not affiliated with Defendants) has observed,
`
`“[b]ooks are a rich source of both fine-grained information, how a character, an object or a scene
`
`looks like, as well as high-level semantics, what someone is thinking, feeling and how these
`
`states evolve through a story.”5
`
`63.
`
`In other words, books are the high-quality materials Defendants want, need, and
`
`have therefore outright pilfered to develop generative AI products that produce high-quality
`
`results: text that appears to have been written by a human writer.
`
`
`4 U.S. Patent & Trademark Office, Public Views on Artificial Intelligence and Intellectual
`Property Policy 29 (2020), available at
`https://www.uspto.gov/sites/default/files/documents/USPTO_AI-Report_2020-10-07.pdf (last
`accessed Sept. 19, 2023).
`5 Yukun Zhu et al., Aligning Books and Movies: Towards Story-like Visual Explanations by
`Watching Movies and Reading Books 1 (2015), available at https://arxiv.org/pdf/1506.06724.pdf
`(last accessed Sept. 19, 2023).
`
`
`
`-8-
`
`
`
`
`
`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 9 of 47
`
`64.
`
`This use is highly commercial.
`
`II.
`
`OpenAI’s Willful Infringement of Plaintiffs’ Copyrights
`
`A.
`
`65.
`
`OpenAI
`
`OpenAI (specifically, Defendant OpenAI Inc.) was founded in 2015 as a non-
`
`profit organization with the self-professed goal of researching and developing AI tools
`
`“unconstrained by a need to generate financial return.”6
`
`66.
`
`Four years later, in 2019, OpenAI relaunched itself (specifically, through
`
`Defendant OpenAI GP LLC and Defendant OpenAI LP) as a for-profit enterprise.
`
`67.
`
`Investments began pouring in. Microsoft Corporation, one of the world’s largest
`
`technology companies, invested $1 billion in 2019, an estimated $2 billion in 2021, and a
`
`staggering $10 billion in 2023, for a total investment of $13 billion.
`
`68.
`
`B.
`
`69.
`
`Industry observers currently value OpenAI at around $29 billion.
`
`GPT-N and ChatGPT
`
`OpenAI’s LLMs are collectively referred to as “GPT-N,” which stands for
`
`“Generative Pre-trained Transformer” (a specific type of LLM architecture), followed by a
`
`version number.
`
`70.
`
`71.
`
`72.
`
`GPT-3 was released in 2020 and exclusively licensed to Microsoft the same year.
`
`OpenAI further refined GPT-3 into GPT-3.5, which was released in 2022.
`
`In November 2022, OpenAI released ChatGPT, a consumer-facing chatbot
`
`application built on GPT-3.5.
`
`
`6 OpenAI, Introducing OpenAI (Dec. 11, 2015), https://openai.com/blog/introducing-openai (last
`accessed Sept. 19, 2023).
`
`
`
`-9-
`
`
`
`
`
`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 10 of 47
`
`73.
`
`ChatGPT’s popularity exploded virtually overnight. By January 2023, less than
`
`three months after its release, the application had an estimated 100 million monthly active users,
`
`making it one of the fastest-growing consumer applications in history.
`
`74.
`
`75.
`
`GPT-4, the successor to GPT-3.5, was released in March 2023.
`
`GPT-4 underlies OpenAI’s new subscription-based chatbot, called ChatGPT Plus,
`
`which is available to consumers for $20 per month.
`
`76.
`
`Defendants intend to earn billions of dollars from this technology.
`
`77. When announcing the release of ChatGPT Enterprise, a subscription-based high-
`
`capability GPT-4 application targeted for corporate clients, in August 2023, Defendants claimed
`
`that teams in “over 80% of Fortune 500 companies” were using its products.7
`
`78.
`
`GPT-4 also underlies Microsoft’s Bing Chat product, offered through its Bing
`
`Internet search engine.
`
`C.
`
`79.
`
`Knowingly “Training” GPT-N on Copyrighted Books
`
`OpenAI does not disclose or publicize with specificity what datasets GPT-3,
`
`GPT-3.5, or GPT-4 were “trained” on. Despite its name, OpenAI treats that information as
`
`proprietary.
`
`80.
`
`To “train” its LLMs—including GPT-3, GPT-3.5, and GPT-4—OpenAI has
`
`reproduced copyrighted books—including copyrighted books authored by Plaintiffs here—
`
`without their authors’ consent.
`
`81.
`
`OpenAI has admitted as much.
`
`
`7 OpenAI, Introducing ChatGPT Enterprise (Aug. 28, 2023),
`https://openai.com/blog/introducing-chatgpt-enterprise (last accessed Sept. 19, 2023).
`
`
`
`-10-
`
`
`
`
`
`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 11 of 47
`
`82.
`
`OpenAI has admitted that it has “trained” its LLMs on “large, publicly available
`
`datasets that include copyrighted works.”8
`
`83.
`
`Again: OpenAI’s “training” data is “derived from existing publicly accessible
`
`‘corpora’ ... of data that include copyrighted works.”9
`
`84.
`
`OpenAI has admitted that “training” LLMs “require[s] large amounts of data,”
`
`and that “analyzing large corpora” of data “necessarily involves first making copies of the data to
`
`be analyzed.”10
`
`85.
`
`OpenAI has admitted that, if it refrained from using copyrighted works in its
`
`LLMs’ “training,” it would “lead to significant reductions in model quality.”11
`
`86.
`
`Accordingly, OpenAI has openly admitted to reproducing copyrighted works in
`
`the course of “training” its LLMs because such reproduction is central to the quality of its
`
`products.
`
`87.
`
`ChatGPT itself admits as much. In response to a query submitted to it in January
`
`2023, the chatbot responded,
`
`It is possible that some of the books used to train me were under
`copyright. However, my training data was sourced from various
`publicly available sources on the internet, and it is likely that some
`of the books included in my training dataset were not authorized to
`be used. ... If any copyrighted material was included in my training
`data, it would have been used without the knowledge or consent of
`the copyright holder.
`
`
`8 OpenAI, Comment Regarding Request for Comments on Intellectual Property Protection for
`Artificial Intelligence Innovation, U.S. Patent and Trademark Office Dkt. No. PTO-C-2019-
`0038, at 1 (2019), available at
`https://www.uspto.gov/sites/default/files/documents/OpenAI_RFC-84-FR-58141.pdf (last
`accessed Sept. 19, 2023).
`9 Id. at 2.
`10 Id.
`11 Id. at 7 n.33.
`
`
`
`-11-
`
`
`
`
`
`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 12 of 47
`
`88.
`
`Until very recently, ChatGPT could be prompted to return quotations of text from
`
`copyrighted books with a good degree of accuracy, suggesting that the underlying LLM must
`
`have ingested these books in their entireties during its “training.”
`
`89.
`
`Now, however, ChatGPT generally responds to such prompts with the statement,
`
`“I can’t provide verbatim excerpts from copyrighted texts.” Thus, while ChatGPT previously
`
`provided such excerpts and in principle retains the capacity to do so, it has been restrained from
`
`doing so, if only temporarily, by its programmers.
`
`90.
`
`In light of its timing, this apparent revision of ChatGPT’s output rules is likely a
`
`response to the type of activism on behalf of authors exemplified by the Open Letter addressed to
`
`OpenAI and other companies by Plaintiff The Authors Guild, which is discussed further below.
`
`91.
`
`Instead of “verbatim excerpts,” ChatGPT now offers to produce a summary of the
`
`copyrighted book, which usually contains details not available in reviews and other publicly
`
`available material—again suggesting that the underlying LLM must have ingested the entire
`
`book during its “training.”
`
`92.
`
`OpenAI is characteristically opaque about where and how it procured the entirety
`
`of these books, including Plaintiffs’ copyrighted works.
`
`93.
`
`94.
`
`OpenAI has discussed limited details about the datasets used to “train” GPT-3.
`
`OpenAI admits that among the “training” datasets it used to “train” the model
`
`were “Common Crawl,” and two “high-quality,” “internet-based books corpora” which it calls
`
`“Books1” and “Books2.”12
`
`
`12 Tom B. Brown et al., Language Models Are Few-Shot Learners 8 (2020), available at
`https://arxiv.org/pdf/2005.14165.pdf (last accessed Sept. 19, 2023).
`
`
`
`-12-
`
`
`
`
`
`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 13 of 47
`
`95.
`
`Common Crawl is a vast and growing corpus of “raw web page data, metadata
`
`extracts, and text extracts” scraped from billions of web pages. It is widely used in “training”
`
`LLMs, and has been used to “train,” in addition to GPT-N, Meta’s LlaMa, and Google’s BERT.
`
`It is known to contain text from books copied from pirate sites.13
`
`96.
`
`97.
`
`OpenAI refuses to discuss the source or sources of the Books2 dataset.
`
`Some independent AI researchers suspect that Books2 contains or consists of
`
`ebook files downloaded from large pirate book repositories such as Library Genesis or
`
`“LibGen,” “which offers a vast repository of pirated text.”14
`
`98.
`
`99.
`
`LibGen is already known to this Court as a notorious copyright infringer.15
`
`Other possible candidates for Books2’s sources include Z-Library, another large
`
`pirate book repository that hosts more than 11 million books, and pirate torrent trackers like
`
`Bibliotik, which allow users to download ebooks in bulk.
`
`100. Websites linked to Z-Library appear in the Common Crawl corpus and have been
`
`included in the “training” dataset of other LLMs.16
`
`101. Z-Library’s Internet domains were seized by the FBI in February 2022, only
`
`months after OpenAI stopped “training” GPT-3.5 in September 2021.
`
`
`13 Alex Hern, Fresh Concerns Raised Over Sources of Training Material for AI Systems, The
`Guardian (Apr. 20, 2023), available at
`https://www.theguardian.com/technology/2023/apr/20/fresh-concerns-training-material-ai-
`systems-facist-pirated-malicious (last accessed Sept. 19, 2023).
`14 Kate Knibbs, The Battle Over Books3 Could Change AI Forever, Wired (Sept. 4, 2023),
`available at https://www.wired.com/story/battle-over-books3 (last accessed Sept. 19, 2023).
`15 See Elsevier Inc. v. Sci-Hub, No. 1:15-cv-4282-RWS (S.D.N.Y.).
`16 Kevin Schaul et al., Inside the Secret List of Websites that Make AI Like ChatGPT Sounds
`Smart, The Washington Post (Apr. 19, 2023), available at
`https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning (last accessed
`Sept. 19, 2023).
`
`
`
`-13-
`
`
`
`
`
`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 14 of 47
`
`102. The disclosed size of the Books2 dataset (55 billion “tokens,” the basic units of
`
`textual meaning such as words, syllables, numbers, and punctuation marks) suggests it comprises
`
`over 100,000 books.
`
`103.
`
`“Books3,” a dataset compiled by an independent AI researcher, is comprised of
`
`nearly 200,000 books downloaded from Bibliotik, and has been used by other AI developers to
`
`“train” LLMs.
`
`104. The similarities in the sizes of Books2 and Books3, and the fact that there are only
`
`a few pirate repositories on the Internet that allow bulk ebook downloads, strongly indicates that
`
`the books contained in Books2 were also obtained from one of the notorious repositories
`
`discussed above.
`
`105. OpenAI has not discussed the datasets used to “train” GPT-3.5, GPT-4, or their
`
`source or sources.
`
`106. GPT-3.5 and GPT-4 are significantly more powerful than their predecessors. GPT
`
`3.5 contains roughly 200 billion parameters, and GPT 4 contains roughly 1.75 trillion
`
`parameters, compared to GPT-3’s roughly 175 billion parameters.
`
`107. The growth in power and sophistication from GPT-3 to GPT-4 suggests a
`
`correlative growth in the size of the “training” datasets, raising the inference that one or more
`
`very large sources of pirated ebooks discussed above must have been used to “train” GPT-4.
`
`108. There is no other way OpenAI could have obtained the volume of books required
`
`to “train” a powerful LLM like GPT-4.
`
`
`
`-14-
`
`
`
`
`
`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 15 of 47
`
`109.
`
`In short, OpenAI admits it needs17 and uses18 “large, publicly available datasets
`
`that include copyrighted works”19—and specifically, “high-quality”20 copyrighted books—to
`
`“train” its LLMs; pirated sources of such “training” data are readily available; and one or more
`
`of these sources contain Plaintiffs’ works.
`
`110. Defendants knew that their “training” data included texts protected by copyright
`
`but willfully proceeded without obtaining authorization.
`
`D.
`
`GPT-N’s and ChatGPT’s Harm to Authors
`
`111. ChatGPT and the LLMs underlying it seriously threaten the livelihood of the very
`
`authors—including Plaintiffs here, as discussed specifically below—on whose works they were
`
`“trained” without the authors’ consent.
`
`112. Goldman Sachs estimates that generative AI could replace 300 million full-time
`
`jobs in the near future, or one-fourth of the labor currently performed in the United States and
`
`Europe.
`
`113. Already, writers report losing income from copywriting, journalism, and online
`
`content writing—important sources of income for many book authors. The Authors Guild’s most
`
`recent authors earnings study21 shows a median writing-related income for full-time authors of
`
`just over $20,000, and that full-time traditional authors earn only half of that from their books.
`
`
`17 OpenAI, Comment Regarding Request for Comments, supra, at 7 n.33.
`18 Id. at 2.
`19 Id. at 1.
`20 Brown et al., Few-Shot Learners, supra, at 8.
`21 Authors Guild, “Top Takeaways from the 2023 Author Income Survey (2023),
`https://authorsguild.org/news/top-takeaways-from-2023-author-income-survey (last accessed
`Sept. 19, 2023).
`
`
`
`-15-
`
`
`
`
`
`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 16 of 47
`
`The rest comes from activities like content writing—work that is starting to dry up as a result of
`
`generative AI systems like ChatGPT.
`
`114. An Authors Guild member who writes marketing and web content reported losing
`
`75 percent of their work as a result of clients switching to AI.
`
`115. Another content writer (unrelated to the Plaintiffs here) told the Washington Post
`
`that half of his annual income (generated by ten client contracts) was erased when the clients
`
`elected to use ChatGPT instead.22
`
`116. Recently, the owner of popular online publications such as Gizmodo, Deadspin,
`
`The Root, Jezebel and The Onion came under fire for publishing an error-riddled, AI-generated
`
`piece, leading the Writers Guild of America to demand “an immediate end of AI-generated
`
`articles” on the company’s properties.23
`
`117.
`
`In a survey of authors conducted by The Authors Guild in March 2023 (early in
`
`ChatGPT’s lifecycle), 69 percent of respondents said they consider generative AI a threat to their
`
`profession, and 90 percent said they believe that writers should be compensated for the use of
`
`their work in “training” AI.
`
`118. As explained above, until recently, ChatGPT provided verbatim quotes of
`
`copyrighted text. Currently, it instead readily offers to produce summaries of such text. These
`
`summaries are themselves derivative works, the creation of which is inherently based on the
`
`
`22 Pranshu Verma & Gerrit De Vynck, ChatGPT Took Their Jobs. Now They Walk Dogs and Fix
`Air Conditioners, The Washington Post (June 2, 2023), available at
`https://www.washingtonpost.com/technology/2023/06/02/ai-taking-jobs (last accessed Sept. 19,
`2023).
`23 Todd Spangler, WGA Slams G/O Media’s AI-Generated Articles as ‘Existential Threat to
`Journalism,’ Demands Company End Practice, Variety (July 12, 2023),
`https://variety.com/2023/digital/news/wga-slams-go-media-ai-generated-articles-existential-
`threat-1235668496 (last accessed Sept. 19, 2023).
`
`
`
`-16-
`
`
`
`
`
`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 17 of 47
`
`original unlawfully copied work and could be—but for ChatGPT—licensed by the authors of the
`
`underlying works to willing, paying licensees.
`
`119. ChatGPT creates other outputs that are derivative of authors’ copyrighted works.
`
`Businesses are sprouting up to sell prompts that allow users to enter the world of an author’s
`
`books and create derivative stories within that world. For example, a business called Socialdraft
`
`offers long prompts that lead ChatGPT to engage in “conversations” with popular fiction authors
`
`like Plaintiff Grisham, Plaintiff Martin, Margaret Atwood, Dan Brown, and others about their
`
`works, as well as prompts that promise to help customers “Craft Bestselling Books with AI.”
`
`120. OpenAI allows third parties to build their own applications on top of ChatGPT by
`
`making it available through an “application programming interface” or “API.” Applications
`
`integrated with the API allow users to generate works of fiction, including books and stories
`
`similar to those of Plaintiffs and other authors.24
`
`121. ChatGPT is being used to generate low-quality ebooks, impersonating authors,
`
`and displacing human-authored books.25 For example, author Jane Friedman discovered “a cache
`
`of garbage books” written under her name for sale on Amazon.26
`
`122. Plaintiffs and other professional writers are thus reasonably concerned about the
`
`risks OpenAI’s conduct poses to their livelihoods specifically and the literary arts generally.
`
`
`24 Adi Robertson, I Tried the AI Novel-Writing Tool Everyone Hates, and It’s Better than I
`Expected, The Verge (May 24, 2023),
`https://www.theverge.com/2023/5/24/23732252/sudowrite-story-engine-ai-generated-cyberpunk-
`novella (last accessed Sept. 19, 2023).
`25 Jules Roscoe, AI-Generated Books of Nonsense Are All Over Amazon’s Bestseller Lists, Vice
`(June 28, 2023), https://www.vice.com/en/article/v7b774/ai-generated-books-of-nonsense-are-
`all-over-amazons-bestseller-lists (last accessed Sept. 19, 2023).
`26 Pilar Melendez, Famous Author Jane Friedman Finds AI Fakes Being Sold Under Her Name
`on Amazon, The Daily Beast (Aug. 8, 2023), https://www.thedailybeast.com/author-jane-
`friedman-finds-ai-fakes-being-sold-under-her-name-on-amazon (last accessed Sept. 19, 2023).
`
`
`
`-17-
`
`
`
`
`
`Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 18 of 47
`
`123. Plaintiff The Authors Guild, among others, has given voice to these concerns on
`
`behalf of working American authors.
`
`124. The Authors Guild is the nation’s oldest and largest professional writers’
`
`organization. It “exists to support working writers and their ability to earn a living from
`
`authorship.”27
`
`125. Among other principles, The Authors Guild holds that “authors should not be
`
`required to write or speak without compensation. Writers, like all professionals, should receive
`
`fair payment for thei