Authors Accuse OpenAI of Using Pirate Sites to Train ChatGPT

DarksideIH · July 2, 2023

Authors Accuse OpenAI of Using Pirate Sites to Train ChatGPT

Generative AI is a revolutionary technology that's expected to change society as we know it but, in parallel, it raises many copyright infringement concerns. This week, book authors Paul Tremblay and Mona Awad filed a lawsuit against OpenAI, accusing the company of using pirated books to train its ChatGPT models.

Generative AI models such as ChatGPT have captured the imagination of millions of people, offering a glimpse of what an AI-assisted future might look like.

The new technology also brings up novel copyright questions. Several rightsholders are worried that their work is being used to train AI without any form of compensation, for example.

How these and other copyright questions will be dealt with is not entirely clear. Governments around the world are taking different approaches, with U.S. Congress recently stating that it doesn’t plan to overreact. Meanwhile, rightsholders don’t intend to stand idly by.

Authors Sue OpenAI for Copyright Infringement

This week, authors Paul Tremblay and Mona Awad filed a class action lawsuit against OpenAI, accusing ChatGPT’s parent company of copyright infringement and violating the DMCA, among other things. According to the authors, ChatGPT was partly trained on their copyrighted works, without permission.

The proof for this claim is seemingly simple. The authors never gave OpenAI permission to use their works, yet ChatGPT can provide accurate summaries of their writings. This information must have come from somewhere.

“Indeed, when ChatGPT is prompted, ChatGPT generates summaries of Plaintiffs’ copyrighted works—something only possible if ChatGPT was trained on Plaintiffs’ copyrighted works,” the complaint reads.

Pirate Training

While these types of claims are not new, this week’s lawsuit alleges that OpenAI used pirate websites as training input. This potentially includes Z-Library, a shadow library of millions of pirated books that’s at the center of a criminal prosecution by the U.S. Department of Justice.

OpenAI hasn’t disclosed the datasets that ChatGPT is trained on, but in an older paper two databases are referenced; “Books1” and “Books2”. The first one contains roughly 63,000 titles and the latter around 294,000 titles.

These numbers are meaningless in isolation. However, the authors note that OpenAI must have used pirated resources, as legitimate databases with that many books don’t exist.

“The only ‘internet-based books corpora’ that have ever offered that much material are notorious ‘shadow library’ websites like Library Genesis (aka LibGen), Z-Library (aka Bok), Sci-Hub, and Bibliotik. The books aggregated by these websites have also been available in bulk via torrent systems.”

Based on these data points, the complaint concludes that OpenAI committed copyright infringement. As compensation, the plaintiffs demand statutory damages, which can reach $150,000 per work. Additional damages for the alleged removal of copyright management information, in violation of the DMCA, are also on the table.

AI, Piracy and Copyright

There is no direct evidence that OpenAI used pirate sites to train ChatGPT. That said, it is no secret that some AI projects have trained on pirated material in the past, as an excellent summary from Search Engine Journal highlights.

The mainstream media has picked up this issue too. The Washington Post previously reported that the “C4 data set,” which Google and Facebook used to train their AI models, included Z-Library and various other pirate sites.

“At least 27 other sites identified by the U.S. government as markets for piracy and counterfeits were present in the data set,” the article added.

The present lawsuit will be closely watched by AI enthusiasts and rightsholders. It may result in OpenAI having to disclose some of its training data, which would be interesting in its own right

Even if it transpires that ChatGPT was trained with pirated books, the court would still have to decide whether that amounted to copyright infringement. Some experts believe that this type of AI training can be considered fair use.

Fair use protects transformative uses of copyrighted works that don’t compete with the original content. According to several experts, that defense could likely apply to AI training cases.

Silent Watcher · July 2, 2023

Avoid unnecessary posts such as 'Thank you', 'Welcome', etc. Such posts will be deleted and user will be warned if it happens again. If caught spamming, the following actions are applicable -

First time - Warning
Second time - 5000 Points will be deducted
Third time - Ban for 7 days
Fourth time - Permanent Ban

If the post helped you, reward the user by reacting to the post like this -

Sign In

Authors Accuse OpenAI of Using Pirate Sites to Train ChatGPT - Piracy News and Crypto Updates - InviteHawk - The #1 Trusted Source for Free Tracker Invites

Authors Accuse OpenAI of Using Pirate Sites to Train ChatGPT

Recommended Posts

DarksideIH

Link to comment

Share on other sites

Silent Watcher

Link to comment

Share on other sites

Join the conversation

Customer Reviews

Similar Topics

InviteRoute: How to Unlock Invites on Private Torrent Sites

A.I. Image Generating Sites.. Do you use them and which are your favorites? Post some examples?

GAY JAV porn torrent sites?

🔥🔥🔥.::J.Stash's XXX | Porn Store - 500+ Sites - Brazzers | Mofos | Reality Kings | More+🔥🔥🔥

In case one of the trackers or DDL sites you use goes down.....DO THIS!!!

Browse

J.Stash's Store

Free Stuff

Official Store

Rewards

Important Information