▲人工智能生成公司是時候付代價了

蘇思鴻 律師
發表時間:2024/07/03 11:03 408 次瀏覽

人工智能生成模型需要大量資料、圖案及影像去訓練,故人工智能與著作權息息相關。在美國已有許多公司狀告人工智能公司,藉此捍衛其著作權,而在台灣尚未有此種案例,我們拭目以待觀其後續如何。
Authors, artists and others are filing lawsuits against generative AI companies for using their data in bulk to train AI systems without permission.
作家、藝術家和其他人對未經其允許大量利用其等資料來訓練人工智能系統之人工智能生成公司提起訴訟


Many people feel it's time AI companies paid for the free data lunches that have made their generative systems big and strong.許多人覺得此時乃人工智能公司對免費午餐付費之時,這些公司利用大量免費資料使得旗下之生成系統變得強大。

Recently, a bevy of legal action demanding compensation from AI companies has been filed in the U.S. and Europe. The plaintiffs include authors, artists and major media organizations who have consistently expressed concern about AI stealing their work and producing mediocre derivatives.
最近在美國及歐洲,已有諸多公司對人工智能公司起訴請求損害賠償。這些原告包括作家、藝人及媒體組織,其等不斷地表達人工智能竊取其著作並藉此創作出拙劣衍生著作之憂慮。

An open letter from the Authors Guild -- signed by more than 8,500 authors, including Margaret Atwood, Dan Brown and Jodi Picoult -- urges tech companies responsible for generative AI applications, such as ChatGPT and Bard, to cease using their works without proper authorization or compensation. The authors want companies to pay for the data they scraped for training -- the "food" for AI systems, endless meals for which there has been no bill.
一封由超過8500名作家署名,這些作家包括:Margaret Atwood, Dan Brown (達文西密碼原著小說之作者)與Jodi Picoult,並由作家工會出名發出公開信,力促科技公司要對未經授權或付費而使其所有人工智能生成應用程式利用其等著作之行為負責。作家們欲使人工智能公司擷取大量資料用以訓練生成模型的行為付費。對人工智能猶如食物之資料,等同是無須付費之餐點,比吃到飽還慘。

Authors also express concern that generative AI threatens their profession by flooding the market with machine-written content based on their work. This was a problem in recent months as Amazon took action against AI authors spamming the bestseller list with generated works.
作家們亦就生成人工智能威脅到到他們的生計表示擔憂,這些生成人工智能基於作家們之作品創作成由機器撰寫的內容充斥整個市場。這在近幾個月成為問題,像Amazon就對人工智能作家之生成著作襲捲暢銷書排行榜乙事提起訴訟。

Prior to the release of the Authors Guild letter, two North American authors -- Mona Awad and Paul Tremblay -- filed a lawsuit against OpenAI, claiming the organization breached copyright law. The suit argued that OpenAI breached copyright law because ChatGPT generated accurate summaries of the author's works and, therefore, must've trained on the authors' works. They aren't the only ones. Author and comedian Sarah Silverman is also suing OpenAI and Meta for illegally reproducing her memoir, The Bedwetter, without permission. But that argument may not hold up in court because of the way generative AI works.
在作家公會發布公開信之前,兩位北美作家Mona Awad 及 Paul Tremblay ,狀告OpenAI,主張其違反著作權法。理由是ChatGPT對作家們之著作生成精準的摘要內容,並以之去訓練其生成模型。其並非唯一之訟案。作家兼喜劇演員Sarah Silverman亦狀告OpenAI及Meta未經其允許違法重製其傳記The Bedwetter。但該主張在法律站不住腳,因為與生成人工智能著作不相符。

Individual authors and artists aren’t the only plaintiffs. In December 2023, The New York Times became the first major American news publication to sue OpenAI for using copyrighted works in AI development.
非僅像作家及藝人等自然人為原告。2023年12月紐約時代報成為全美第一個主流新聞出版品對OpenAI起訴侵害著作權。

What is generative AI?生成人工智能是什麼?

Generative AI is the technology that powers ChatGPT and Bard. Text-based generative AI uses algorithms to predict the likely next words in text and generates that text based on a prompt from the user. ChatGPT knows what to generate because it was trained on a large corpus of publicly available data from the internet. It learned patterns from the training and matches those patterns to prompts from the user.

Generative AIs are usually black box AI systems, meaning nobody -- not even the programmers -- understands the exact steps the machine takes to go from input to output. Input goes in, the magic happens and output comes out.
生成人工智能常是黑箱作業,意指沒有任何人甚至是程式設計師能精確地知悉機器從輸入到輸出之進程。當一輸入,猶如魔術般之輸出結果產生。

All machine learning and generative AI tools use preexisting works of some kind.

Why are people suing?為何提告?

People are suing AI companies over copyright. Even though ChatGPT's trained on data from the internet, it does so without permission from the data creators. For example, GPT-3 was trained on Wikipedia and Reddit, among other sources. However, conversations about and segments of copyrighted works could exist in the training material and give large language models enough context to accurately summarize those copyrighted works.

On a larger scale, people are suing because AI is a black box, and it's impossible to know how it works on a granular level. The fear is that people will use AI to avoid taking responsibility for their decisions or the things it produces.

"If AI companies are allowed to market AI systems that are essentially black boxes, they could become the ultimate ends-justify-the-means devices," Matthew Butterick, one of the lawyers behind several of the lawsuits, wrote in his blog. "Before too long, we will not delegate decisions to AI systems because they perform better. Rather, we will delegate decisions to AI systems because they can get away with everything that we can't."

What AI lawsuits have been filed?那些人工智能公司被訴?

Numerous cases have been brought against generative AI companies regarding copyright and misuse. Here are some of the companies being sued.

GitHub, Microsoft and OpenAI

A class-action suit was filed against these companies involving GitHub's Copilot tool. The tool predictively generates code based on what the programmer has already written. The plaintiffs allege that Copilot copies and republishes code from GitHub without abiding by the requirements of GitHub's open source license, such as failing to provide attribution. The complaint also includes claims related to GitHub's mishandling of personal data and information, as well as claims of fraud. The complaint was filed in November 2022. Microsoft and GitHub have repeatedly tried to get the case dismissed.

Stability AI, Midjourney and DeviantArt

A complaint against these AI image generator providers was filed in January 2023. The plaintiffs alleged the systems directly infringe on plaintiffs' copyrights by training on works created by the plaintiffs and creating unauthorized derivative works. The complaint also takes issue with the fact that the tools can be used to generate work in the style of artists. The judge on the case, William Orrick, said he was inclined to dismiss the lawsuit.

Stability AI

In January 2023, Getty Images issued a complaint against Stability AI for allegedly copying and processing millions of images and associated infringing on authors' copyrights. Butterick is one of the attorneys representing the authors. The complaint estimated that more than 300,000 books were copied in OpenAI's training data. The suit seeks an unspecified amount of money. The case was filed in June 2023.

OpenAI and Microsoft

The New York Times is suing OpenAI for copyright infringement. The case, filed December 2023, alleges that millions of New York Times articles were used to train and develop OpenAI’s chatbot and other technology, which now competes with the news organization as a source of reliable information. The case also alleges that OpenAI’s language models mimic the Time’s style and recites its content verbatim. The Times is the first major American news outlet to sue OpenAI and Microsoft for copyright infringement. The Times approached the companies earlier in the year to discuss the copyright issue but never reached an agreement.

Eight other newspapers filed a lawsuit against OpenAI and Microsoft on April 30, 2024, alleging they've purloined millions of copyrighted news articles to train their AI. Newspapers included in the suit are The New York Daily News, Chicago Tribune, Denver Post, Mercury News, Orange County Register, St. Paul Pioneer-Press, Orlando Sentinel and South Florida Sun Sentinel.

Meta and OpenAI

Sarah Silverman's lawsuit against Meta and OpenAI alleged copyright infringement and said ChatGPT and Large Language Model Meta AI (Llama) were trained on illegally acquired data sets with her work contained. The suit alleges the books were acquired from shadow libraries, such as Library Genesis, Z-Library and Bibliotek, where the books can be torrented. Torrenting is a common method of downloading files without proper legal permission. Specifically, Meta's language model, Llama, was trained on a data set called the Pile, which uses data from Bibliotek, according to a paper from EleutherAI, the company that assembled the Pile. The suit was filed in July 2023.

Google

A class-action lawsuit is being brought against Google for alleged misuse of personal information and copyright infringement. Some of the data specified in the lawsuit includes photos from dating websites, Spotify playlists, TikTok videos and books used to train Bard. The lawsuit, filed in July 2023, said Google could owe at least $5 billion. The plaintiffs have elected to remain anonymous.

These copyright cases against big tech companies aren't the first of their kind. In 2015, the Author's Guild sued Google for making digital copies of millions of books and providing snippets of them to the public. The court ultimately favored Google, saying the works were transformative and did not provide a market substitute for the books.

Suno and Udio

Sony Music Entertainment, Universal Music Group and Warner Records filed lawsuits against AI song-generator start-ups Suno and Udio in June 2024 for alleged copyright infringement. One lawsuit describes how Suno-generated songs sound very similar to Chuck Berry’s “Johnny B. Goode,” using prompts such as “1950s rock and roll,” “12-bar blues” and “energetic male vocalist.” The Udio lawsuit alleges something similar, saying many outputs sounded like Mariah Carey’s “All I Want for Christmas is You.” The record labels are seeking up to $150,000 for each work that was copied without permission.

What questions do these cases address?這些案子要處理的問題是什麼?

The above lawsuits will be important in answering the following questions:上述的訟案將回應以下幾個重要問題?

  • Does training a model on copyrighted material require a license? Generative AI systems make copies of the training materials as part of the training process. Does that interim copying require a license, or is it fair use?
    利用有著作權的資料去訓練人工智能模型需要授權否?生成人工智能系統在訓練的過程中用訓練資料重製著作,該過渡性重製要否取得授權?或其是合理使用?
  • Does generative AI output infringe on copyright for the materials on which the model was trained?人工智能生成輸出,所利用之資料來訓練模型構成侵害著作權? If generative output constitutes a derivative work or infringes the training data's reproduction right, then it infringes on copyright. Courts will need to rule whether similarities in output and training data are derived from protected materials or unprotected materials. Who is liable for copyright infringement when AI infringes?
  • Does generative AI violate restrictions on removing, altering or falsifying copyright management information? The Digital Millenium Copyright Act provides restrictions on removal or alteration of copyright management information, such as watermarks. This is exemplified in the Stability AI case, where the watermark reproduced by Stable Diffusion on generated works constituted false copyright management information.
  • Does generating work in the style of someone violate that person's rights? This is known as the right of publicity, which varies from state to state. It prohibits the use of someone's likeness, name, image, voice or signature for commercial gain.
  • How do open source licenses apply to training AI models and distributing the resulting output? The plaintiffs in the Copilot case argued that republishing Copilot training materials without attribution -- and not making Copilot itself open source -- violates open source license terms.

As the cases continue to take shape and answers emerge, companies involved with generative AI tools should watch for guidance around the intersection of AI and intellectual property and check to see if they need risk mitigation strategies.

蘇思鴻 律師

  • 聯絡電話: 0920235793
  • 執業年資: 5年以上
  • 蘇律師事務所
  • online consulting