Close to a dozen copyright infringement cases were filed against several AI companies in 2023, but one lawsuit in particular could have the widest implications.
Since OpenAI’s release in late 2022 of ChatGPT-3.5, a chat bot that responds to any random question with coherent responses, its core technology, generative AI has taken the tech world by storm. While the technology still makes errors in its responses, it has vastly improved, and as the New York Times alleges in its suit, it has trained its large language models on content from the bulk of the internet.
“The most highly weighted dataset in GPT-3, Common Crawl, is a ‘copy of the Internet,’” of which the New York Times is the “most highly represented proprietary source (and the third overall behind only Wikipedia and a database of U.S. patent documents,” the Times said in its initial complaint, which was filed in federal court in the Southern District of New York.
The complaint also contends that the defendants’ GenAI tools can “generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples.”
“The New York Times case has a bunch of advantages,” said Robert Brauneis, a George Washington University Law School professor and co-director of the school’s Intellectual Property Program. “The attorneys have leaned from some of the previous lawsuits.” One strategic move on the part of the Times, he said, was adding Microsoft as a defendant. “The early suits against OpenAI do not include Microsoft,” Brauneis said.
Another element that makes this a potentially better case for copyright issues in the new AI world is the vast scope of copyrighted material that the Times alleges has been infringed.
OpenAI has not yet filed a response to the suit, but in a blog post earlier this month, the company said it disagrees with claims in the New York Times’s suit, and that the media giant is not telling the whole story. OpenAI said that training its large language models with “publicly available internet materials” is considered to be fair use but that it also provides an opt-out process for publishers, which it said the Times adopted in August 2023.
John Hasnas, professor and director of Georgetown University’s Institute for the Study of Markets and Ethics, said that providing the opt out of training feature “is likely to have no effect because there will be so many reproductions of Times articles posted online by others that the AI will still encounter them.”
Another prominent element in the Times’s case compared with some of the other suits is the output from ChatGPT cited in the suit.
“The New York Times has done a very thorough job of identifying not only a large volume of their content used to train the AI systems…but some good examples of the output from these AI systems that closely resembles New York Times’s content,” said Jason Bloom , a partner and chair of intellectual property litigation at the Haynes and Boone law firm.
The Times is also willing to spend on attorneys in its lawsuit. Unlike some of the class-action suits filed against OpenAI by authors and artists, the New York Times’s case doesn’t need to depend on the hope of a reward of legal fees or some portion of the damages in the end, said Brauneis, the George Washington University professor. “That makes it a little cleaner,” he added.
Of course, being in journalism, this columnist and likely other journalists and creators of content hope the media giant wins a big victory.
Yet the issue is not so clear. For example, a slightly similar case against Alphabet Inc.’s
Google filed by the Author’s Guild in 2005 centered around its scanning of books for Google Books, but that case did not fare well in the courts. An appeals court ruled that Google’s non-authorized scanning of copyrighted books was transformative, with a limited public display of text — a key indicator of legal fair use, which allows for limited use of copyrighted materials without permission for news reporting, criticism, educational purposes and research. In 2016, the U.S. Supreme Court let the appellate court ruling in the Google Books case stand.
In another instance, a case filed by three artists against Stability AI and two other companies, Midjourney, Inc., and DevantArt, Inc., on behalf of a class, has had many of its allegations knocked out. The presiding judge has limited the lawsuit significantly and granted most of the motions to dismiss, including a finding that two of the plaintiffs had not even copyrighted the work they were seeking to defend.
“These are really interesting lawsuits,” said Hasnas, the Georgetown professor. “With a case this new, everything is speculative.” He said an important factor in the Times case will center around how the technology actually works and whether the models are really absorbing/learning or just spitting out reproductions of stories almost verbatim, which he believes would be a copyright violation. “But I still do not see how the Times can make its case unless Open AI is reproducing the content of the Times articles,” he added.
OpenAI recently told Bloomberg it was in talks with other media organizations to license their content. Those talks include CNN, Fox and Time, even after it held talks that fell apart with the New York Times. The Times said in its suit that it had discussions for months with OpenAI to reach a negotiated agreement to permit the use of its products in new digital products.
In Davos this week, OpenAI Chief Executive Sam Altman told Bloomberg that OpenAI does not want to train its systems on New York Times data. But that is unlikely to quash the ongoing legal efforts.
Brauneis, of George Washington University, said Altman’s comments should be taken with a grain of salt. “There are some terms on which OpenAI would be very happy to license it [the New York Times content], even given the current legal uncertainty of whether a license is necessary,” Brauneis said.
Added Bloom, of Haynes and Boone: “While an agreement by Open AI not to use NYT data on a going-forward basis may resolve some issues, that would not necessarily resolve NYT’s claims for past use and whatever impact that may have on the current and future operations of the model.”