AI and Copyright: A New Type of User?

The Pervasively Disputable and Persistently Tedious Subject

May 15, 2025

After four months of being gone, I am back! This actually marks one year of the CollaborAItion blog, and it is a milestone that I am happy to celebrate! I am glad that so many of you have had opportunities to examine information literacy, AI, and copyright with me (and archives, for some of the posts!).

About six months into this blog, I wrote my first article about copyright and AI. It was an adaptation of a chapter of an OER book I collaborated on with several colleagues throughout the world, which can be found in the original article. The blog post, however, is the most up-to-date iteration, with new links and arguments.

Navigating Artificial Intelligence, Copyright, and Fair Use

Reed Hepler

September 7, 2024

NOTE: I recently had the privilege of editing and writing several chapters in a textbook entitled “Intro to AI and Ethics in Higher Education”. The following is a reproduction and adaptation of my chapter “Implications of Copyright Law on the Use of GenAI Tools in Education and Workplace

Read full story

The Fair Use Argument

One of the loudest arguments regarding AI training with copyrighted works is the “fair use” argument. OpenAI, along with the ACRL and ALA, claims that use of copyrighted works in training data does not violate copyright law because it constitutes fair use. The “fair use doctrine” states that an individual or group may use a copyrighted work to create transformative works if their use fulfills two or three of four factors:

the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
the nature of the original copyrighted work;
the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
the effect of the use upon the potential market for or value of the copyrighted work.

While the “fair use” argument certainly seems compelling, it is important to note that a fair use argument is an argument, not an allowance. In other words, the validity of a fair use claim is determined by the court, not by the plaintiff or the defendant. Therefore, if you are considering using a generative AI tool, be sure that you can document and justify any fair use argument you wish to make.

The two main arguments regarding the “fair use doctrine” center around facets 1 and 4. They are the “non-consumptive use” argument and the “non-expressive use” arguments.

Edit: While preparing for my webinar on Copyright and AI for Library 2.0, I found that the non-consumptive use argument is mostly applicable when considering the “stochastic parrot” model, and the non-expressive use pairs well with the “world model” conception of how LLMs work.

Non-Consumptive Use

Supporters of using copyrighted works in training data emphasize that the tool does not retain or transmit the main points of the copyrighted work unless explicitly asked. Even if it is asked to do so, it only transmits these in summary as a commentary on the original work. It will not use the central elements of the copyrighted work. Instead, it learns about syntax and communication from these works and stores the metadata about the information and data in the works. This type of use is what supporters call “non-consumptive.”

Non-Expressive Use

Other supporters refer to a related concept called “non-expressive use,” which is an argument frequently used by creators of search engines, databases, and other systems and products that use context and metadata to provide resources to users. Essentially, they argue that since the purpose of using the copyrighted material is NOT to express the ideas of that material, their use of copyrighted works is transformative.

GenAI As a “New Type of User”

For much of the history of copyright enforcement, derivative infringement has relied on the presence of an identifiable human agent—a person who knowingly used, altered, or repurposed protected works without proper authorization. In such cases, the liability was clear: the human user committed the act, and the law applied its weight accordingly. However, the introduction of generative AI has disrupted this linear model by creating an intermediary between source material and user-generated content. The user no longer copies a work directly but rather prompts an algorithm trained on a vast corpus—some of it copyrighted—to generate something "new." The question arises: does the infringing act (if there is one) belong to the human who typed the prompt, or the algorithm that interpreted and produced the response?

Generative AI tools occupy a liminal role, functioning neither solely as instruments nor fully autonomous agents. In this sense, they constitute a new type of user—a probabilistic interpreter that mediates the creation of derivative works through what might be termed "data calculus-informed probabilitization." Unlike direct human derivation, which may simply involve copying and pasting, AI outputs are the result of probabilistic modeling, where no single source is directly reproduced. This distinction is crucial. The AI tool does not infringe in a conventional sense because it does not copy; it synthesizes, drawing on correlations and likelihoods rather than discrete appropriation. Still, it is the user who provides the impetus, the prompt, and often the intended outcome. Courts and copyright offices, such as in the cases of Zarya of the Dawn and Ellen Rae’s novel, have repeatedly signaled that outputs generated via AI cannot be copyrighted if they are not meaningfully altered by a human, but also that the AI itself is not considered a creator under U.S. law.

Furthermore, it is essential to differentiate between two related but legally distinct acts: training an AI model on copyrighted materials, and using that model to generate outputs via prompting. Training involves ingestion of data, often including copyrighted works, to refine the model’s internal parameters. Proponents argue this is a non-consumptive use, falling under fair use because the original works are not reproduced, but instead inform pattern recognition and prediction. Conversely, prompting is an active process initiated by the user, which may result in outputs that mimic, reference, or reconfigure protected works. Here, derivation hinges on intent and specificity—if a prompt seeks to replicate a particular style, structure, or expression, it edges closer to infringement.

Thus, as AI tools increasingly act as co-creators or mediators, they defy traditional models of authorship and accountability. We must reconsider not just the legal infrastructure, but the philosophical categories that define what it means to create, to derive, and to infringe. To do so, we must navigate the ambiguous space between human agency and machine computation with nuance, integrity, and a commitment to preserving the rights of original creators while enabling new forms of creativity.

I talk about this idea in the webinar I did with Steve Hargadon in December 2024, Research and AI. Feel free to order a recording of the previous version, and be aware that we may do an updated version in a few months, which means you will get another free recording.

Side note: If you like this subject, or would like a webinar on any other subject, let me know! I plan to do a webinar on a monthly basis.

PDF Prompt Attachments Are Not Training

I have to take a moment to argue against one of the most misunderstood practices that users claim is covered by “fair use.” It actually isn’t.

The conflation of prompting with training, particularly in reference to user-uploaded materials such as PDF attachments, reflects a significant misunderstanding of both prompting’s technological function and the legal implication of directly importing copyrighted material into a prompt.

Training, in the context of artificial intelligence, refers to the process of feeding vast corpora of data into a machine learning model to adjust its internal parameters through iterative, statistical refinement. AI training is a comprehensive, systematic process where models ingest vast datasets—often billions of examples—to develop statistical representations of patterns, relationships, and structures. This process, often executed over extensive computational cycles, allows the model to develop probabilistic representations of text, or whatever format it needs to generate that it can later recombine to produce outputs. Once the training phase concludes, the model’s behavior becomes relatively fixed until it is retrained or fine-tuned with additional data. The training data becomes abstracted into mathematical weights and parameters, with no verbatim storage of original materials.

By contrast, attaching a PDF file to a prompt is an act of querying, not ingestion. The file is read by the AI model in the temporary context of a single session; it is not incorporated into the model’s underlying architecture. When a user uploads a document, such as a scholarly article, a grant proposal, or a novel manuscript, the generative AI tool interprets it in the moment—much as a human assistant might skim and summarize it—without altering its internal statistical weights or drawing on that document in future interactions with other users.

In other words, training involves broad exposure to diverse materials to develop general capabilities, while PDF attachments provide specific, targeted information for a particular query. The user explicitly directs the AI to consider this specific document and directly remove and copy content (if that is in the prompt).

Third, the processing mechanism is entirely different. During training, data is processed through complex optimization algorithms that adjust millions or billions of parameters. With PDF attachments, the model is simply accessing the document content through its context window

Therefore, suggesting that uploading a PDF into an AI session constitutes "training the model" misrepresents the nature of both the interaction and the resulting product. To do so would be akin to claiming that one trains a calculator each time one enters a new equation. Prompts with attachments are ephemeral, user-specific, and computationally isolated. They serve as inputs for immediate interpretation, not as building blocks for future generative behavior. There is no “fair use” argument, except maybe for educators (and even then, it would not be easily argued because the copyrighted work would be used with the tool, not with an educator’s students).

Intentionally Malicious Prompts are Copyright Violations

Related to the “fair use” argument is the fact that one of the most public lawsuits against OpenAI, who created ChatGPT, was supported by malicious and adversarial prompts to “force” the AI tool to reproduce copyrighted content verbatim.

“It seems they intentionally manipulated prompts, often including lengthy excerpts of articles, in order to get our model to regurgitate. Even when using such prompts, our models don’t typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts.”

This is clear misuse of the tool, as stated in the Terms and Conditions of OpenAI. Then, the forced reproductions were seen as proof that ChatGPT explicitly violated copyright without guardrails.

Similarly, if a user prompts a tool to reproduce a copyrighted work verbatim, then that does not qualify as fair use. However, it is not the fault of the AI tool, but of the user who prompted it.

USCO Releases: Copyrightability of AI Materials

In January and May, I read parts 2 and 3 of the USCO’s Report on genAI tools. I assume that most everyone in the library or education field did. Rather than reproduce all of my insights in my posts (I refuse to call them “skeets”) on BlueSky, I will let them guide our discussion, and I’ll talk about my replies below.

One of the most frequent misinterpretations of the report, which I predicted as soon as I read it, was that everyone would say, “Oh, creators can do whatever they want with AI and can copyright it!” What really was written was (paraphrased):

“artists have to prove that they did not use AI to make major decisions regarding their product,
and then they only get copyright for the arrangement, selection, curation, and manipulation of the output.”

In May 2025, part 3 of the report was released, which incidentally resulted in the Librarian of Congress and the United States Copyright Office Director both being terminated. This happened, ostensibly, because the report stated that “yes, in general generative AI tool training and use qualifies as fair use, but the majority of people are not using these tools that way, so we need to establish some rules.”

OpenAI vs. Deepseek

One of the most notable lawsuits in which OpenAI was the plaintiff was versus “DeepSeek,” a Chinese LLM that, OpenAI alleges, stole their RLHF (reinforcement learning with human feedback) data so they would not have to engage in that lengthy process. Instead, they used “distillation,” in which a large model’s outputs and training data are applied to newer, smaller, and sleeker models.

Everyone noted the irony of OpenAI accusing an AI of committing “copyright infringement,” but their case was not even a copyright case: it was a “patent” infringement or trade secret case.

On the surface, this is not that different. However, the distinction is important when you are considering tool training. The models created by OpenAI are considered products, tools that have been made through a specific process (which includes specific processes like RLHF). DeepSeek took that data and used a similar process (at least for the RLHF parts of their training) to create their own tool.

OpenAI was not worried that DeepSeek would reproduce their output. They were outraged because, essentially, DeepSeek stole their patented process, or “recipe,” for their model.

Conclusion: The User is the Agent

I would encourage anyone who enjoyed this post to read an excellent paper by Carys Craig, professor at York University, entitled “The AI-Copyright Trap.” An expert in intellectual property law, Craig states that we should “sidestep the copyright trap, resisting the lure of its proprietary logic in favor of more appropriate routes towards addressing the risks and harms of generative AI.”

One of my favorite quotes from her paper is, “"Authorship is a fundamentally communicative act[;]... just as... AI is categorically incapable of authoring original works of expression, it is incapable of receiving, reading, or enjoying them as such."

If an AI cannot be an author, and is incapable of communicating of its own volition, then the human is responsible for the output, and should

therefore be able to copyright products created using AI, and
be held responsible for any copyright infringement that results, especially if they directed the AI tool to infringe.
CollaborAItion is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

CollaborAItion

Navigating Artificial Intelligence, Copyright, and Fair Use

Parrots and Octopi vs. Domain Models

Discussion about this post

CollaborAItion

AI and Copyright: A New Type of User?

The Pervasively Disputable and Persistently Tedious Subject

Navigating Artificial Intelligence, Copyright, and Fair Use

Other Articles

Parrots and Octopi vs. Domain Models

The Fair Use Argument

Non-Consumptive Use

Non-Expressive Use

GenAI As a “New Type of User”

PDF Prompt Attachments Are Not Training

Intentionally Malicious Prompts are Copyright Violations

USCO Releases: Copyrightability of AI Materials

OpenAI vs. Deepseek

Conclusion: The User is the Agent

Discussion about this post