A New Description Of GenAI Tool Development and Use

Less Tech-y Jargon, More Salient Details

Jun 18, 2025

Recent Discussions and Presentations

In the past few weeks, on LinkedIn and BlueSky, I have been discussing multiple ethical issues about AI, many of which are impacted by 1) my co-discussants’ understandings of how AI tools are created, and 2) their perspectives on how AI creates outputs. Mostly, we are talking about copyright and attribution.

Earlier today, I had an excellent conversation with Steve Hargadon and around 25 participants regarding Copyright and AI! We talked about the two main perspectives regarding AI processes, stochastic parrots and world/domain models, and how those perspectives intersect with fair use arguments. We also talked about the non-consumptive and non-expressive fair use arguments and connected them to the stochastic parrot and world model perspectives on AI, respectively. Finally, we covered the USCO reports on copyright and AI and some misunderstood practices that violate the recommendations of those reports. It was excellent! You can buy a recording here at Library 2.0. I really think that this session, since it connects the different perspectives of generative AI with how LLMs are made and their copyright implications, is one of the most important webinars I have ever done. Thank you to everyone who participated!

We blended and expanded on the content covered in these blog posts:

AI and Copyright: A New Type of User?

Reed Hepler

May 15

After four months of being gone, I am back! This actually marks one year of the CollaborAItion blog, and it is a milestone that I am happy to celebrate! I am glad that so many of you have had opportunities to examine information literacy, AI, and copyright with me (and archives, for some of the posts!).

Read full story

Parrots and Octopi vs. Domain Models

Reed Hepler

December 9, 2024

This article is the second in the “AI Is Not Smart” conversation. The first post briefly discussed, in the perspective of Chalmers, the idea that text generators/large language models/natural language processors process text by categorizing and associating data and information. This contradicts the “language-not-context,” “statistical text processing” e…

Read full story

I suppose that the second blog post could be more aptly named “Animals vs. Maps,” because those terms so aptly describe the differences between the assigned abilities and constructs in these perspectives.

Why a New Description?

The previous explanations I gave in my first webinars were centered around OA, public domain black-and-white drawings; while the explanations and drawings were technically correct, they were filled with jargon and detail (“neural networks,” “parameters,” “vectors,”) that were only necessary if one was taking a Master’s level course—much too confusing! With some feedback from attendees, and Steve, I created this diagram to talk about the most salient points for practitioners integrating AI into education and information professionalism.

You’ll note that in this explanation I still use “vectors,” we can’t get away from all jargon, but hopefully this will be much easier to understand.

The Foundation: How AI Tools are Constructed, From Text to Math to Text Again

As I said before, we need to have a correct understanding of how AI tools are constructed before we talk about the ethical concerns regarding their sources and their operations. I created this high-level diagram to demonstrate how the process works and serve as a basis for a discussion about the process. It does not explain everything in and of itself (that would be one complicated diagram) but it contains elements that are cues for a verbal or oral explanation.

So. here is the verbal description that goes with the image, that serves as the foundation for many ethical concerns related to AI use, integration, and safety guidelines.

Raw Data, Information

We start with the collection or amalgamation of raw data to create AI tools. This includes open access materials, such as arXiv, Project Gutenberg, and Wikipedia. There were also copyrighted materials harvested for many of these tools.

Data Conversion and Analysis

If copyrighted materials are used to train AI models, then why do so many institutions, including the USCO and ALA, say that using copyrighted materials to train AI tools is fair use?

Text data is transformed into “tokens,” or values (numbers) that are associated with other concepts, ideas, and text. “Vectors” show the strength between multiple tokens, and thousands of tokens and vectors create a thousand-dimensional graphed web.

Once the data is converted into these tokens and assigned vectors to indicate their relationship with other words or phrases, the text data is removed from the model. All that remains is the mathematics. Copyrighted material is not directly used to construct outputs, because it is nowhere in the model by the time the end users interact.

AI models do not rely on the data. They rely on the values assigned to data pieces AFTER the original data has been deleted. In other words, value-formatted METADATA (you know this word had to come up, right? I am a librarian, after all).

All of this metadata is structured (i.e., the vectors are calibrated and the values are assigned) in accordance with the syntax and communication patterns of the original text, and that is how large language models communicate so well.

This transformation process is critical to understanding how AI systems work. The original text is essentially "digested" into mathematical representations that capture patterns, relationships, and linguistic structures. This abstraction process means that while the model learns from copyrighted works, it doesn't store or reproduce those works directly—it only retains the patterns and relationships between concepts, which is why this is considered a transformative use.

Large Language Model Development

The massive webs of mathematical data created through data conversion are used to create Large Language Models, which generate text according to the patterns indicated in the vectors and values. Different libraries of texts create different value-and-vector webs, which is why Claude, ChatGPT, and Perplexity all have similar foundations (LLM) but all communicate differently and have different strengths.

During this phase, the mathematical representations undergo intensive computational processing to create models that can predict and generate text. These models are trained through complex algorithms that adjust billions of parameters to optimize performance. The training process involves showing the model examples and having it predict what comes next, then adjusting its parameters based on how accurate its predictions were.

Each model developer makes unique choices about training data, model architecture, and optimization techniques, resulting in distinct AI personalities and capabilities despite sharing similar foundational principles.

Front-End Use (and Trainers that Mimic Front-End Use)

After the models are created, humans use various processes and roles to further calibrate and refine the text generation. In some cases, the models train by ingesting text conversations where one human is modeling the AI outputs and one human is modeling a prompter. In other training scenarios, called Reinforcement Learning Human Feedback, human trainers interact with the model to assign scores and give text feedback in ways that mimic real user interactions. This, hopefully, trains the model to align its communications with general human values, ethics, etc. It is an effort to put some type of human communication.

Actual front-end use is the next step, and LLMs interact with, and “internalize” feedback from, actual users much like they did their RLHF trainers. Through actual conversations with external users, LLMs continuously understand how non-technology specialists view AI; what aspects, keywords, and communication styles are expected by subject matter experts in various fields; and how to promote the fallacy that they are infallible by always providing an answer, even if they have to fabricate one.

As it communicates and learns keywords and expectation of subject matter experts and others, the LLM also creates a web of its own, which I have talked about multiple times, including in the “Parrots and Octopi vs. Domain Models” article.

Yes, probabilities and calculus are included in the first phase of the training, and they do play a role in creating outputs, but somewhere in the operations of an LLM there is also an element that provides some context and logical structure. This structure and its connections can change based on data acquired from the web, or from feedback from human users.

This concept of creating maps of various ideas is analogous to “schemata” in the education and psychology worlds. Every person has its own idea of a restaurant, for example. This idea is similar to others’ but is unique to us since only we have had all the restaurant experiences that we have had. Previous associations with a particular restaurant may be disregarded or emphasized based on future interactions with that restaurant or with another restaurant.

All of these preparations for specialization by introducing concepts to the “domain model” and participating in RLHF to fine-tune communications prepares the LLM for its final stage: being applied outside of its original purpose.

Field/Domain Specific Application

Below are only some of the specific applications for which LLMs are used. Some tools are only used for one of these purposes, but I would suspect that most of the tools we use (ChatGPT, Claude, BoodleBox, etc.) can perform multiple if not all of these functions.

RAG

Retrieval-Augmented Generation (RAG) combines the generative capabilities of large language models with the ability to retrieve specific information from external knowledge sources. By grounding AI responses in verified information, RAG systems can provide more accurate, up-to-date, and trustworthy outputs for specialized knowledge domains.

Chatbot

Conversational AI applications transform large language models into interactive assistants that can engage in natural dialogue. These systems incorporate additional layers of processing to maintain context across conversations, understand user intent, and generate helpful, contextually appropriate responses.

Web Search

AI-enhanced search tools (engines and chatbots and otherwise) represent a significant alternative approach in how we access information. Unlike traditional keyword-based search, these systems understand the semantic meaning behind queries and can synthesize information from multiple sources to provide comprehensive answers. They can interpret ambiguous questions, understand context, and present information in ways that directly address user needs.

Understanding and Using ChatGPT Search

Reed Hepler

November 23, 2024

What is “ChatGPT Search,” or “SearchGPT”?

Read full story

Data Analysis

AI systems excel at finding patterns and insights in complex datasets. When applied to data analysis, language models can interpret trends, generate reports, and even create visualizations that help humans make sense of information. These tools democratize data science by allowing non-specialists to ask questions about their data in natural language and receive meaningful insights in return.

Conclusion

I hope that the above discussion was not too long and drawn out, but just in case I was as long-winded in my writing as I am in my talking, here is what I am trying to say:

Large language models use mathematics and probabilities derived from a massive library of texts to generate text,

which models are produced according to specific guidelines and goals set forth by model creators,

and whose subsequent outputs are refined through text communications by humans to satisfy ethical, benchmark, and communicative goals,

and whose users apply these models for various applications in a variety of fields.

Hopefully, this can help you understand how raw data is processed and discarded, how mathematics and probabilities plays a huge role in the initial creation of the model, and how text communications shape the later test phases of the tool, which gives the model the opportunity to create linguistic and logical “webs” or “models” or “maps,” which are multiple steps removed from the copyrighted material and raw data from the first phase of creation.

A Future Session

One session ends, another begins! I am going to combine all of the “Ethics and AI” webinars’ implications and help librarians (and anyone else interested) create an Ethical AI Framework for their institution.

As artificial intelligence tools increasingly shape how libraries serve their patrons and support academic missions, librarians must play a pivotal role in establishing ethically sound practices. This workshop offers participants a structured, accessible approach to identifying and addressing the ethical concerns surrounding AI technologies in library environments. Participants will explore key issues such as data privacy, misinformation, algorithmic bias, academic integrity, and authorship—grounding these topics within the frameworks of information literacy, labor ethics, and professional responsibility.

The 90-minute session begins by examining foundational concepts in ethics and how they apply to emergent technologies, followed by an investigation of AI’s impact on student privacy, faculty autonomy, and community equity. Drawing on scholarly literature, institutional guides, and media resources—including work by Reed Hepler, Torrey Trust, and international ethics bodies—attendees will engage in critical analysis and collaborative reflection. Participants will also explore open access tools and frameworks that support equitable, transparent AI adoption in educational and public-facing contexts.

By the end of the session, each attendee will create the foundation of a localized AI ethical framework suitable for their institution. This includes a set of guiding principles, actionable practices, and evaluative processes. These resources will be tailored to align with institutional values, professional best practices, and pedagogical goals, and they will be adaptable for both student-facing materials and internal staff training initiatives.

Participants will come away with a customizable template for creating institutional and personal ethical frameworks.

LEARNING OBJECTIVES:

Identify and articulate key ethical concerns related to AI use in libraries and education.
Analyze how AI tools intersect with academic integrity, data privacy, misinformation, and labor.
Evaluate and apply best practices for AI use in student-facing and institutional contexts.
Design an initial framework for AI ethics aligned with educational and professional values.

LEARNING OUTCOMES:

Demonstrate a foundational understanding of AI ethics in relation to information services.
Critically assess AI tools and workflows for compliance with privacy, equity, and integrity standards.
Produce a draft version of an AI ethical framework specific to their library or institutional setting.
Gain confidence in leading discussions and trainings on ethical AI use within their organization.

This 90-minute online hands-on workshop is part of our Library 2.0 "Ethics of AI" Series. The recording and presentation slides will be available to all who register.

DATE: Tuesday, July 8th, 2025, 2:00 - 3:30 pm US - Eastern Time

COST:

$129/person - includes live attendance and any-time access to the recording and the presentation slides and receiving a participation certificate. To arrange group discounts (see below), to submit a purchase order, or for any registration difficulties or questions, email admin@library20.com.

CollaborAItion

A New Description Of GenAI Tool Development and Use

Less Tech-y Jargon, More Salient Details

Recent Discussions and Presentations

AI and Copyright: A New Type of User?

Parrots and Octopi vs. Domain Models

Why a New Description?

The Foundation: How AI Tools are Constructed, From Text to Math to Text Again

Raw Data, Information

Data Conversion and Analysis

Large Language Model Development

Front-End Use (and Trainers that Mimic Front-End Use)

Field/Domain Specific Application

RAG

Chatbot

Web Search

Understanding and Using ChatGPT Search

Data Analysis

Conclusion

A Future Session

Discussion about this post