As AI creative tools are becoming widespread, the question of copyright of AI creations has also taken centre-stage. But while copyright nerds obsess over the authorship question, the issue that is getting more attention from artists is that of copyright infringement.
AI is trained on data, in the case of graphic tools such as Imagen, Stable Diffusion, DALL·E, and MidJourney, the training sets consist of terabytes of images comprising photographs, paintings, drawings, logos, and anything else with a graphical representation. The complaint by some artists is that these models (and accompanying commercialisation) are being built on the backs of human artists, photographers, and designers, who are not seeing any benefit from these business models. The language gets very animated in some forums and chats rooms, often using terms such as “theft” and “exploitation”. So is this copyright infringement? Are OpenAI and Google about to get sued by artists and photographers from around the world?
This is a question that has two parts, the input phase and the output phase.
The explosion in the sophistication of AI tools has come because of two important developments, firstly the improvement and variety of training models, but most importantly, the availability of large training datasets. The first source of works stems from open access or public domain works, these are sources that are licensed under permissible licences such as Creative Commons (example here), or they’re works that are in the public domain (example here). But of course the amount of such datasets is limited, so researchers can have access to many other datasets, some are even free (lists here and here).
But researchers may also want to try and scrape images from the largest image repository in the world: the Internet. Can they do that? There’s growing recognition that mining data (in this case in the shape of images) is allowed under copyright as fair use or fair dealing. The earliest source of an exception for training an AI can be found in the United States in the shape of the Google Books case. This was a long-running dispute between Authors Guild and Google over scanning books for a service called Google Print (later renamed Google Book Search). After a lengthy battle involving settlements and appeals, the court decided that Google’s scanning was fair use, the transformative nature of the scanning played a big part in the decision, as well as the fact that the copying would not affect the market for book sales online, the purpose of the Google database was to make the works available to libraries, as well as to provide snippets in search results.
While Google Books does not deal specifically with machine learning, it is similar in many ways to what happens in most machine learning training, there is copying of large amounts of works to produce something different.
In the EU, the Digital Single Market Directive has also opened the door for wider adoption of text and data mining. In Art 3 the Directive sets out a new exception for copyright for “reproductions and extractions made by research organisations and cultural heritage institutions in order to carry out, for the purposes of scientific research, text and data mining of works or other subject matter to which they have lawful access.” Art 4 extends this permission to commercial organisations for any purpose, as long as they have lawful access to the work, and also give rightsholders the opportunity to opt-out of this exception.
The end result of the above is that a large number of commercial entities operating both in the US and Europe are able to scrape images from the Internet for the purpose of data mining, and they can make reproduction and extraction of such materials. Furthermore, other countries such as the UK and Japan have similar exceptions.
Between open data, public domain images, and the data mining exceptions, this means that we can assume that the vast majority of training for machine learning is lawful. While it is possible to imagine some data being gathered and used unlawfully, I cannot imagine that the biggest organisations involved in AI are infringing the law in this respect.
Assuming a lot of the inputs that go into training AI are lawful, then what about the outputs? Could a work that has been generated by an AI trained on existing works infringe copyright?
This is trickier to answer, and it may very well depend on what happens during and after the training, and how the outputs are generated, so we have to look in more detail under the hood at machine learning methods. A big warning first, obviously I’m no ML expert, and while I have been reading a lot of the basic literature for a few years now, my understanding is that of a hobbyist, if I misrepresent the technology it is my own fault, and will be delighted to correct any mistakes. I will of course be over-simplifying some stuff.
The main idea behind creative AI is to train a system in a way that it can generate outputs that statistically resemble their training data, in other words, in order to generate poetry, you train the AI with poetry, if you want it to generate faces, you train it with faces. There are various models for generative AI, but the two main ones are generative adversarial networks (GANs) and diffusion models.
GAN is a model that uses two agents set against each other (hence the adversarial) in order to generate better outputs. There is a generator, which generates output based on a training dataset, and there is a discriminator, which compares the generated output against the training data in order to discern if it resembles it, and if it does not then it is discarded in favour of outputs that resemble the input.
For a relatively long time GANs were the king of machine learning, as they managed to produce some passable output (see all of these cats that don’t exist). But GANs have limitations, the discriminator could be too good, so no output would pass the grade, or the generator could learn to produce only a limited type of output that would pass the discriminator.
The most successful recent examples of AI, such as Imagen, DALL·E, Stable Diffusion, and MidJourney, are using the diffusion model, which reportedly produces superior results. Diffusion works by taking an input, for example an image, and then corrupting it by adding noise to it, the training takes place by teaching a neural network to put it back together by reversing the corruption process.
The most important takeaway from the perspective of a legal analysis is that a generative AI does not reproduce the inputs exactly, even if you ask for a specific one. For example, I asked MidJourney to generate “Starry Night by Vincent Van Gogh”. The result was this:
It looks like it, but it’s not the exact same thing, it’s almost as if the AI is drawing it from memory, which in some way it is, it’s re-constructing what Starry Night looks like.
Moreover, the developers of these tools are aware of the potential pitfalls of producing exact replicas of art in their training datasets. OpenAI admitted that this was a problem in some of the earlier iterations of the program, and they now filter out specific instances of this happening. According to OpenAI, this was mostly taking place with low-quality images, which would be easier to memorise for the neural network, and also there were images that had a lot of repetition in the datasets. They mitigated for this by training the system to recognise duplicates, and DALL·E no longer does image regurgitation.
So, if there is no direct infringement, and the systems are not reproducing works in their entirety, is there still a possibility to have copyright infringement? Most people have been generating prompts of artists that are long dead, and whose works are in the public domain. So the AIs will easily produce works in the style of Van Gogh, Rembrandt, Henri Rousseau, Gauguin, Matisse, etc. Just put the name of the artist in your prompt, and even the specific artwork that you want reproduced, and the AI will do it. But these works are in the public domain, so nobody cares. What about artists that are still alive, and their works under copyright?
Here things get trickier. It is clear that one can produce art in the style of a living artist. For a test I used Simon Stålenhag, a young artist who produces very iconic and easily-recognisable art. I prompted DALL·E for “Crumbling ruined city, sci-fi digital art by Simon Stålenhag”, and it produced a few images that were very much in the style of Stålenhag.
The problem is that style and a “look and feel” are not copyrightable. Sure, the image is clearly inspired by his work, but it would be a stretch to say that it infringes his copyright. Evidence of this is that if you go to any digital art repository and search for Stålenhag, you will find hundreds of images from human artists that are referencing his work (see for example at Bēhance and ArtStation).
Copyright protects the expression of an idea, not the idea itself (the famous idea/expression dichotomy). It will be difficult in my opinion for an artist to successfully sue for copyright infringement as their style is not protected, and as mentioned above, it is unlikely that an AI tool will reproduce a work verbatim (can you use verbatim for images? I digress).
The best case against an AI tool may be when they reproduce a well-known character, say for example, Darth Vader, Mario, or Pikachu, or a picture of Groot and Baby Yoda. But while I could see this easily as potential infringement of an existing character, it is unlikely that this would be pursued by the copyright owner unless there is a good reason to do so. It is unlikely that a person or a company would make these things available commercially, and in that sense, it would not be any different to all of the infringement that already exists on the Internet made by humans.
This blog post is just scratching the surface of the conflicts that are to come with regards to AI and copyright. I am sure that at some point an artist will try to sue one of the companies working in this area for copyright infringement. Assuming that the input phase is fine and the datasets used are legitimate, then most infringement lawsuits may end up taking place in the output phase. And it is here that I do not think that there will be substantive reproduction to warrant copyright infringement. On the contrary, the technology itself is encoded to try to avoid such a direct infringement from happening.
So what we will see is people trying to argue for styles, and here a decision may rest entirely on the specifics of the case. I am not convinced that a court would find infringement, but it’s still early days.
On the meantime, I leave you with a picture of llamas in the style of Klimt’s The Kiss.