Revisiting copyright infringement in AI inputs and outputs – TechnoLlama

Do androids dream of electric copyright?

I’ve been busy in recent weeks writing my last project for the summer, a book chapter on AI and copyright for the next edition of Law, Policy and the Internet, edited by Lilian Edwards. This is a very popular textbook series because it presents chapters in a variety of Internet Law subjects in an approachable manner, and they tend to do so also very thoroughly. Textbook chapters are their own thing, in my opinion a good chapter will deal less with the minutiae of current legal affairs, and more with the larger picture. So my original plan was to base some of the chapter on existing writing, particularly blog posts and a couple of other articles, and brining it together into an engaging and approachable chapter. The first part is finished, but when I started looking for inspiration for the copyright infringement section, I realised that quite a lot of what I had written is already out of date. I also realised that this is perhaps a perfect time to write something that brings us up to speed with the core issues, and where I see things going forward. This is because I did not want to touch too much on the ongoing cases, and I intend to look at the bigger picture that is emerging after three years of generative AI.

So much has happened since the last time that I wrote specifically about this subject, that I wanted to write a blog post that brings my main ideas together, and this will eventually be used as part of the new chapter. Yes, I do use blog posts as seeds for my articles, that is one way I can justify all of the time I spend writing stuff for this blog.

Before we begin, it’s useful to quickly refresh what we mean when we’re talking about inputs and outputs in the context of AI and copyright. As the name entails, the input phase means the stage of training an AI using data, this encompasses a complex arrange of operations that extract information from the data in order to train a model. Input in this context is also a term used for the training data itself, and it is useful to distinguish whether we are talking about the entire training process, or just the data. For the most part we are going to use the term here to mean the training phase itself, and not just the input data.

The result of the input phase is a trained model that can be prompted, and this is what produces outputs. So an output is going to be whatever result is given by the model responding to those instructions, this can be text, music, sound, voice, video, or images, or even a combination of those.

Inputs and copyright

Training an AI model requires vast amounts of data, mindbogglingly huge datasets comprising practically everything you can think of that can be digitised. In order for there to be training, there is always going to have to be a copy made at some stage of the input process. Some datasets fall under the public domain, or do not have copyright because they are not protected works, such as raw data, but a large percentage of data is indeed protected by copyright, and making a copy is an exclusive right of the author.

It is this fact that has prompted dozens of lawsuits, and most of them fall under the input phase. The legal question is relatively simple, AI companies trawl the Web for content which goes into a dataset that is then used to train a model, this is at face value an act that infringes copyright, so the owner of the work being trained can sue for copyright infringement. The end.

The problem is that this idealised version does not really truly describe just how the training process works. For starters, when we think of someone making a copy, we immediately imagine a copy made directly into a hardware medium, and then those copies being kept. This is very far from the truth.

Training an AI model begins with the collection of large datasets, and yes, there may be copies of works there. These datasets can come from a variety of sources depending on the task, so large amount of text from websites and books, images from public repositories and the Web, audio from speech recordings, or a mix of modalities. So for the purpose of copyright infringement we really have to consider the fact that a single work has on its own a negligible value to the overall dataset, what matters is the collective, the large amount of data, the billions of words. This blog is in the training data, and the individual value of even the entire corpus of articles here is miniscule to the overall training data. The goal of gathering inputs is to compile a representative sample of the kind of information the model will need to understand or generate, and this is why each individual work tends to be less important to the whole. Once collected, the data often goes through a cleaning and preprocessing stage, where duplicates, errors, or irrelevant entries are removed, and the remaining content is converted into a format suitable for computational use. It is this extensive cleaning process what makes the copyright argument for individual works more difficult as well, works are no longer recognisable in their original format, they’re data that is put together with other similar data.

Once the data is ready, the model is trained through a process of optimisation. This usually involves feeding the data into a neural network and adjusting the internal parameters (referred to as weights), so that the model can correctly predict outputs, such as the next word in a sentence, or what a llama looks like. I’m horribly simplifying a complex process, but you get the idea. During training, the model makes predictions and compares them to the correct answers using a “loss function,” which quantifies how wrong it is. The model then adjusts itself using algorithms like gradient descent to minimise that error. This is a process that requires lots of human intervention, but it is also mostly automated and repeated millions of times over the training data. The result is a trained model that captures patterns, associations, and statistical relationships in the data, and can apply this knowledge to new, unseen inputs. The final model can then be fine-tuned for. This last stage is also something that is often completely misunderstood in copyright debates about AI. Models do not keep copies of the training data once trained.

So what has been happening with the input debate is that we are seeing a disparity between the public perception of what AI training is, and the practice. But it cannot be denied that at some point in the input phase there is going to be a copy made of a work that could be protected by copyright, and this is why there are so many ongoing cases, that copy acts as the original sin of all AI. The legal question is whether that copy can be legally justified. In the US, the answer so far appears to be mixed, the courts seem to be leaning towards fair use in some cases, but there is a serious argument to be made that the source of the data really does matter. By using shadow libraries and other pirated content, the companies have made their fair use argument mode difficult to sustain.

A similar thing has been taking place with the EU, where the regulatory seas have been choppy, but waters are starting to calm down. The EU has published the General-Purpose AI Code of Practice, this is a complimentary document to the AI Act, where signatory parties make a compromise to comply with the rules set forth. Not every AI developer has to do this however, firstly it only involves the providers of GPAI models, which are defined in the act as models that present a systemic risk, and these are models that were trained exceeding 10²⁵ FLOPs of computing power. Also, this is not part of the legislation, so it is a voluntary code of practice that will signal compliance with the Act. So far Google, Anthropic, and OpenAI have signalled their willingness to sign, while Meta has said that it will not.

The Code of Practice has been criticised by rightsholders, but it is an interesting step forward in the AI Wars because it signals two main issues: firstly it cements the EU’s opt-out regime, at least for the foreseeable future. The second is that it forbids signatories from circumventing technological protection measures. This is important because, together with early rulings from the US, we are starting to see a trans-Atlantic agreement on a few things, particularly that some training will be fine (respecting fair use in the US and opt-outs in the EU), but using torrents and content that hasn’t made available by the rightsholder will not be deemed acceptable.

It is still early-days for the input debate, lots of lawsuits to go, and I am expecting lots of different decisions around the world, but for now we may be witnessing the hints of a way forward. The reproductions at the input phase may prove to be less problematic than previously thought. We can also probably see that regulators may jump in to try to impose levies on AI training, which is the latest proposal from Axel Voss in the EU.

Outputs and copyright

As mentioned already, most of the litigation has been taking place in the input phase. But I think that going forward we are going to see more action with outputs, particularly as models become better in the future.

When it comes to outputs, we get a potential copyright infringement when a trained model reproduces substantially something in its training data without authorisation. While this means that there is definitely an intrinsic link between inputs and outputs, they could be dealt with as separate legal issues. So for example, a work could be in the training data but the model is not capable of reproducing it as an output, be it because it hasn’t memorised it (a technical term), or because the AI interface has built-in safeguards to prevent such reproduction, for example, filtering out character names in an image model.

During the early months of the generative AI revolution output reproduction was still quite difficult, while it was possible for text models to be able to sometimes reproduce some parts of famous texts, this was uncommon, and it still remains very difficult, unless there is some prompting and feeding of a work to a model. The end result of this was that most of the early copyright infringement cases could not show any reproduction of their works as outputs, and the few instances where this was presented, these turned out to be heavily prompted by the copyright holders. This led to the aforementioned reliance on input copyright claims.

But the output issue is becoming a more important legal question because models, particularly image models, have become extremely good at reproducing outputs, making the copyright infringement question more pertinent. With a little bit of prompting it is possible to easily produce an output of popular culture characters, something I have discussed before. Initially this was not litigated, but that all changed with the recent lawsuit from Disney against Midjourney, which is almost entirely based on infringing outputs. It turns out that the image generator Midjourney can easily make outputs that are substantial reproductions of their characters. I am on record saying that this is a very difficult lawsuit for Midjourney to defend.

But what this lawsuit signals is a shift in the legal arguments, because if models are increasingly able to reproduce training data, then more output lawsuits will follow. What is likely to happen is that then we may encounter more arguments based on secondary infringement, this is because the person making the infringement, say an image of Mario or Pikachu, will be the user. This will make most of the debate hinge on whether AI model providers are tools, and if any of the defences used for intermediaries and platform providers would apply.

This raises the question of how courts will approach liability in this new phase of generative AI litigation. If a model can be shown to reliably reproduce protected content at the request of a user, then courts will have to weigh whether the model provider has taken sufficient steps to prevent foreseeable infringement. That might involve evaluating the effectiveness of filtering systems, the role of fine-tuning, or the extent to which the provider has encouraged or facilitated infringing use. In that regard, comparisons with existing doctrines on intermediary liability, such as those developed for hosting platforms or search engines, may be useful, but only to a point. Generative AI sits in a more ambiguous position, since the model does not merely index or store content but synthesises new material based on underlying data, some of which may be protected by copyright.

Ultimately, the output cases may prove more decisive in shaping the future AI landscape than the input litigation. They speak directly to what many people experience as the most tangible and immediate issue: the fact that AI models can produce things that look, sound, or read very much like existing copyrighted works. If those outputs are ruled to infringe, and providers are found liable for enabling that, we could see significant changes in how models are trained, released, and made available to the public. That might include tighter controls on user prompts, more aggressive filtering, or a move toward more closed, licensed training sets. In short, outputs may be the battlefield where the broader social legitimacy of generative AI is ultimately tested.

Concluding

I’ve really enjoyed writing about AI and copyright over the past few years. It brings together nearly all of my nerdy obsessions: science fiction, technology, and the ever-fascinating problem of copyright, which has been central to much of my academic life. But we’re only at the beginning. The more this area evolves, the more I realise how complex and multifaceted it is, and how much more there is to unpack. I can’t promise to cover every twist and turn, but I’ll certainly keep thinking it through and sharing where I land.

On the bright side, there’s no shortage of things to write about. I remember when months would go by without a single topic jumping out at me, but now I’m juggling five different ideas at once.

So little time, so much to do! If only there were a technology that could help bring all these ideas into the world… oh wait!