This post also appears on Medium.
Copyright Infringement in AI Training
To best think about the copyright issues in AI training, we should focus on output similarity and not training methodology. The information humans ingest gets transformed by biological neural networks, categorized, compared, rearranged, and in some cases — new, original thoughts may emerge. The end product is judged for its similarity or differences to the source materials humans learned on. The humans’ knowledge, preferences, and learning are not judged. This is how we should be thinking about AI: less focus on the training, more focus on the output.
To restrict AI from training on copyrighted material would have no precedent in how other forms of intelligence that came before AI, train. There are no authors of copyrighted material that did not learn from copyrighted works, be it in manuscripts, art or music. We routinely talk about the influence of a painter or writer on subsequent painters or writers. They have all learned from their predecessors. But copyrights can still be maintained. Many if not most authors or artists have talked about others that have been inspiration, influence or training materials for them.
All humans train on cumulative learning from many past works by other humans. AI may train on just a larger set of past works and be subject to similar rules and constraints but no more and no different. The data are reworked. The data are not stored internally as parts of a work but as abstractions across multiple works. Learning from the textual domain influences image generation when multimodal learning is used, involving significant transformation of any concept or idea. The human brain does the same. Then it may interpolate among these learnings and most likely also extrapolate beyond that to create its own learning. The same instructions multiple times to AI may result in very different outputs which disproves the notion of “copying”. It is worth noting that in very large language models of AI there is significant “emergent behavior” where AI gains skills like reasoning which was never “programmed” into the AI or even expected. AI extends beyond the platitude of next word prediction and it is becoming increasingly evident. It is telling that multimodal models significantly improve AI’s performance in areas beyond the “mode” (like video, MRI images, graphics, etc.). This is far from parrot-like regurgitation and indistinguishable from original human creation. These capabilities are only getting better and very rapidly so; any law should anticipate the evolution we’ll see in the next few decades.
We can borrow from the precedent established by summarization legality in the context of copyright law. Summarizing a copyrighted work may be considered fair use if the use is transformative. Clearly, the focus is on the final result, not the process. Fair use laws don’t need to change either in the face of AI.
The purpose of protection for copyright materials (and patents) is to increase investment in their production by giving authors an incentive to dedicate time and resources in producing it so society can benefit from more, better works. If society can benefit from more AI produced works then we should encourage it while being fair to human producers. Societal benefit should be the overriding goal. Indeed, this has its roots in the first copyright law enacted under the new US Constitution: The Copyright Act of 1790. Its purpose was to promote the progress of science and useful arts by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries. Fair use, the first sale doctrine, exceptions for libraries and other specific types of uses, and statutory licensing, among others, allow works to be used in ways that achieve the overall goals of the copyright system. Training of AI systems should similarly be allowed as it is in line with the goal of maximizing social benefit. As it considers this paradigm shift from AI, the US Copyright Office should continue to leave room for computers to study, analyze, and learn in the same manner that humans do, to further technological advancements that massively improve society.
Allowing AI to train on copyrighted material would have broad public benefit, for which there is precedent in not just the beginnings of our legal history, but also recent legal history: In Sony Corp. of America v. Universal City Studios, Inc. (“Betamax case”), owners of copyrights to movies and shows broadcast on television wanted to prevent Sony from releasing its Betamax VCR, which would be used by consumers to copy (i.e. record) copyrighted shows. The Court ruled for Sony in that case, and specifically that the particular use of the VCR by consumers to copy shows to watch later was a fair use. Importantly, underpinning the case was the strong public policy consideration of the strong benefit to the public of the technology. That is, when copyright owners tried to use copyright law to block an emerging technology deemed to offer strong public benefit, the technology won. Clearly, certain transformations of material are considered fair use. Large models will unlock opportunities in the economy in every single sector of GDP — adding trillions of dollars in value to our economy and society. Holding back these models would be deleterious.
Lastly, copyright material owners have the right to prevent AI from training on this data by not allowing robots to crawl it with simple commands. Opt out is an option. Indeed, many years ago ck12.org changed its open source license to not allow “derivative works”, including AI models. Such restrictions should be explicitly allowed and is the better way to protect original authors.
AI as an Author and User Liability in AI-Copyright Infringement
In line with the above, and for purposes of legal liability and protection in case of copyright violations, AI should be seen as a tool, not an author. All responsibility should lie with the human that uses the tool and commercializes the output. Humans will often edit AI outputs. The end product, not the process, should be judged in line with the logic of the above argument. And liability and authorship should lie with the person or entity that chooses to commercialize or use the output of the AI.