UK inertia on LLMs and copyright is ‘de facto endorsement’

A committee of UK legislators has slammed the government for its response to alleged copyright theft as a “de facto endorsement” of the way tech companies build large language models.

Since OpenAI’s ChatGPT launched in late 2022, industry pundits, the media, and governments have insisted that the LLMs powering the technology are the next leap forward in computing.

However, holders of copyrighted material – text, image, and audio data – have complained that their intellectual property has been stolen on an industrial scale to create them. Model builders argue their use of the material is fair.

In a letter to science and technology minister Michelle Donelan, the House of Lords Select Committee on Communications and Digital said the government’s record on copyright is “inadequate and deteriorating.”

“The issues with copyright are manifesting right now and problematic business models are fast becoming entrenched and normalised. It is worth exploring whether these trends suggest a few larger publishers will obtain some licensing deals while a longer tail of smaller outlets lose out,” the letter said.

“Government’s reticence to take meaningful action amounts to a de facto endorsement of tech firms’ practices. That reflects poorly on this Government’s commitment to British businesses, fair play and the equal application of the law,” it said.

The committee contrasted the government’s approach to copyright and AI with its mission on AI safety: it has backed a new AI Safety Institute with high-level attention from the prime minister and £400 million funding.

“On copyright, the Government has set up and subsequently disbanded a failed series of roundtables led by the Intellectual Property Office. The commitment to ministerial engagement is helpful but the next steps have been left unclear. While well-intentioned, this is simply not enough,” the letter said.

The missive follows a committee report [PDF] that said tech firms were using copyrighted material without permission and “reaping vast financial rewards.”

“The current legal framework is failing to ensure these outcomes occur and the Government has a duty to act. It cannot sit on its hands for the next decade and hope the courts will provide an answer,” the report said.

In its response, the government said the UK had “world-leading protections for copyright and intellectual property.”

In the AI Regulation White Paper consultation, the government said the Intellectual Property Office would engage stakeholders on copyright and AI as part of a working group aiming to agree a voluntary code.

“However, many of the issues relating to copyright and AI are challenging and the working group was not able to reach a consensus,” the response said.

The committee said the government’s response reflected poorly on its “commitment to British businesses, fair play and the equal application of the law.”

Developers of LLMs are in the crosshairs of legal action for alleged breaches of copyright. Earlier this week, eight American newspapers launched legal action to sue Microsoft and OpenAI, claiming the tech duo unlawfully used the publishers’ copyrighted articles to train AI models. Microsoft and OpenAI were sued last year by the New York Times on similar grounds.

Before a Lords committee hearing last year, Dan Conway, CEO of the UK’s Publishers Association, said LLMs were infringing copyrighted content on an “absolutely massive scale.”

“We know this in the publishing industry because of the Books3 database which lists 120,000 pirated book titles, which we know have been ingested by large language models,” he said. “We know that the content is being ingested on an absolutely massive scale by large language models. LLMs do infringe copyright at multiple parts of the process in terms of when they collect this information, how they store this information, and how they how they handle it. The copyright law is being broken on a massive scale.”

Owen Larter, director of public policy at Microsoft’s Office of Responsible AI, said: “It’s really important to understand that you need to train these large language models on large data sets if you’re going to get them to perform effectively, if you’re going to allow them to be safe and secure … There are also some competition issues [in making sure] that training of large models is available to everyone. If you go too far down a path where it’s very hard to obtain data to train models, then all of a sudden, the ability to do so will only be the preserve of very large companies.” ®

Read More

Lindsay Clark