Communication to the public is about to shape the future of AI copyright

[ad_1]

We are probably squarely in the middle of the AI copyright regulatory cycle, judging by previous technological inflexion points. The first half of this cycle has been dominated rightly by copyright input cases, that is, cases that try to discern whether training an AI with content found online and without authorisation from the owners amounts to copyright infringement.

After dozens of lawsuits, we’re starting to see some hints of what the future may bring, all technological revolutions inevitably go through a period of legal uncertainty to eventually settle into some form of equilibrium, and in the last week we’ve seen a couple of settlements between the parties, particularly in Bartz v Anthropic.This could open the floodgates to other similar settlements, but whatever happens, I am going to stick my neck out and argue that the input question will eventually be largely settled through a combination of exceptions and licensing deals. This is the point at which maximalists start screaming, but hear me out. All throughout the history of technology and its interaction with copyright, the law has tended to err on the side of technological advance, often to the detriment of legacy media. During the early days of the Internet it seemed like the biggest threat faced by the fledgling communication technology was copyright, with intermediaries appearing to be sued practically out of existence. But business sense prevailed, and accommodations were made to allow the technology to exist, providing mechanisms for take-down of infringing content. Similarly, during the P2P wars the providers of pirated copies lost all of their cases, but downloads and streaming were allowed to continue, setting the ground for the present media landscape. Piracy lost, but it was always going to lose. Streaming won.

So I strongly believe that the level of AI adoption is such that inevitably the technology will be allowed to continue practically unabated, with a few caveats and assurances to right-holders. Opt-out regimes will continue to gain ground, and AI developers will have to give up training on explicitly pirated content such as shadow libraries. Large media conglomerates will start training on their own catalogues, or allowing such training by others through licensing schemes. Most training will then become a matter-of-fact, much like the current Internet has been allowed to exist. Revenue streams will likely emerge afterwards.

But that won’t be the end of AI litigation, things will inevitably shift towards specific infringement, so instead of inputs we will get output litigation. We are starting to see that with the cases brought against Midjourney by Disney and Warner Bros which are specifically about infringing outputs. These cases are likely to be decided on the particulars, specifically on whether there are outputs that are indeed reproducing works owned by the claimants. I also think the issue of intermediary liability will become an important part of the cases, namely whether the AI developers are to be considered more like tools, and therefore immune from liability for infringement committed by their users.

But the other big element going forward is likely to be that of communication to the public. I have dealt with the subject in previous blog posts, but not in much depth as I did not think that it was that relevant, even though other commentators gave it more prominence. In particular, some people that analysed the LAION case thought that there may have been a communication to the public there. LAION is a non-profit research organisation, and it put together a dataset called LAION-5B, which is a massive collection of image–text pairs scraped from the open web, designed to be freely accessible for training large AI models. Importantly, LAION does not host the images themselves but rather stores URLs pointing to their online locations, alongside associated metadata and embeddings. This detail was crucial in the LAION case, as the distinction between making copies and providing links lay at the heart of some of the legal arguments. There was an argument that amassing a large dataset of links to publicly-available images amounted to a communication to the public, but I personally never considered that to be a viable argument, if I told you to go and retrieve a specific image from the LAION-5b dataset, you would probably find it quite hard to do it.

The state of the law on communication to the public in Europe remains, to put it mildly, unsettled. The CJEU has spent the last decade producing a sprawling body of case law on the subject, from Svensson to GS Media and VG Bild-Kunst, trying to delineate when linking, framing, streaming, or other acts of making content available amount to a new “communication to the public.” The Court has generally adopted a broad reading, often hinging liability on whether the communication was aimed at a “new public” or circumvented technological restrictions. This has created a patchwork of factors that must be assessed case by case, leaving intermediaries and innovators in a constant state of uncertainty. While the doctrine has given rights holders a powerful tool, its application to machine learning training datasets and AI outputs is still largely unexplored territory.

All of this is about to change with the case of Like Company v Google, which I have covered before, so I won’t go into detail here, but I just want to remark that it is a sign of things to come, particularly if we assume that I am right and that the input cases will eventually be settled. Moreover, changes in the way in which AI tools are used are also likely to lead us towards further analysis of the issue of communication to the public.Allow me to elaborate.

Something that is becoming more discussed in policy circles recently is what happens with trained models. So, data is used in training a model, and that model once trained cannot be changed, all of this is the input phase, and as discussed, I believe that this will eventually be settled legal matter. But that is not the only interaction that happens with a trained model, there are three other actions that can be taken with the trained model, and these are inference, fine tuning, and retrieval-augmented generation (RAG). Inference is the process by which a trained model applies its learned parameters to new data in order to produce an output; in short, training is learning, inference is using what has been learned; nothing about the model changes. Fine-tuning is a form of additional training. Instead of freezing the model, you keep adjusting its parameters with new, often smaller, and more specialised datasets. The goal is to adapt a general model (say an image model trained on a specific artist) to perform better on a specific task. Finally, RAG is a technique that enables an LLM to access and incorporate external, up-to-date information into its responses. When a user asks a question that requires knowledge beyond the model’s initial training data, the model can perform a search (the “retrieval” step) to find relevant documents, web pages, or other data sources. It then uses this newly retrieved information to formulate a more accurate and comprehensive answer (the “generation” step).

You may be wondering why all of this is relevant to the communication to the public issue. Well, assuming again that the trained model has passed all legal hurdles either because it complies with a TDM exception, has been trained with public domain data, is fully licensed, or a combination of the above; the actions that can be taken with a model after it has been trained may also come under legal scrutiny. Could any of the above actions open developers to further liability claims?

This is not an idle question. It has already been proposed by the draft Voss report in the JURI committee of the European Parliament that there should be a new transparency requirement for data used in inference, RAG, and fine-tuning. I think that there should be no such transparency requirement, but that is beside the point for now. What interests me more is whether we will start seeing legal action directed not against the model trainers, but against those performing actions after training.

With regard to inference, I don’t think that there will be any issue there, after all, as I have described, the model remains the same. With regard to fine-tuning, there may be issues with the source of the data used in the fine-tuning. It would be possible to have a perfectly legal model and fall foul of copyright by using data for fine-tuning without permission. In my opinion, this will follow a similar analysis to that of training a model in the input phase, and therefore it could be solved in the same way: it’s either infringing or legal. Take your pick.

But the really interesting question in the future, and one where I think maximalists may try to attack next, is with RAG. As mentioned, most LLM models are using some form of information retrieval nowadays. This enhances accuracy, but it can also help in reducing things like hallucinations. By accessing information in real time, a model is accessing data that it was not trained on, particularly more up-to-date information. Why is this legally relevant? You guessed it, that could be considered by some to be a communication to the public. So let’s say I go to Perplexity and ask for information about the LAION case in Germany. It will produce text that includes links (one of them is this blog, well done Perplexity!), which you can find here. The links could potentially be considered a communication to the public of the articles it cites. I don’t think that this is the case, as these links are not being communicated to a new public. And if you’re like me, I’m delighted by any LLM recommending my blog as a source. But not everyone thinks that is the case.

The problem for many people is that RAG output from ChatGPT, Gemini, and Perplexity, whilst more accurate, is also potentially affecting the content that it is using for training. Why would you visit my site from the Perplexity output above if you already obtained an answer? This is the complaint that many rights-holders are using with regard to AI-generated results on Google—apparently web traffic is down because Google is giving you an answer already in the results, and you no longer have the need to visit a website. This could be an issue that could affect the communication to the public argument in the future.

Personally, I’m not persuaded that this falls under communication to the public, but I can totally see a future in which the direct competition between outputs and the data used in training will become an issue, and we are likely to encounter more calls to try to either use communication to the public or the publishers’ right in order to try to curb post-training outputs.

In many ways, this debate mirrors older disputes about aggregation, framing, and hyperlinking, where courts had to decide whether the mere act of pointing to or summarising third-party material amounted to a communication to the public. Those cases often turned on whether the intervention made works available to a “new public” or circumvented restrictions placed by the rights holder. If we follow that logic, most RAG systems arguably do not expand the circle of recipients beyond those already entitled to access the original source. Yet the key difference here is the substitution effect: unlike traditional hyperlinking, which encourages further navigation, AI-generated answers may dissuade users from clicking through at all. This functional shift could give ammunition to claimants who wish to argue that RAG systems are not just neutral conduits but competitive substitutes, thereby reviving communication to the public claims in a new technological guise.

And thus the AI Wars will continue unabated.

[ad_2]