Explaining What’s up with AI (Part 4 – GPT to ChatGPT)

Part 1 of the series is here.

In our last post we saw how an LLM when given some text to start with could generate more text. This idea forms the basis of a tool like ChatGPT but there are still a few steps to get there. GPT3 by OpenAI was maybe the most powerful model before ChatGPT. It was interesting to researchers but didn’t become very popular with the general public because it didn’t really try to solve your problems or interact with you. It could produce sensible text, but not necessarily what you wanted and it couldn’t use the natural conversational style that is easy for people to adopt.

Being Conversational

If we want to build a system we can converse with, then we need to have it understand something about a conversation. The human user will supply some text or a question. We use the term prompt for this input because the human is effectively prompting the LLM to produce some output or reply.

The first problem to consider is how do we make the LLM know when to stop producing output? There’s always some next word it could possibly generate. Once it gets a turn, it might drone on forever and never leave us a chance to follow up. It would also just keep consuming expensive computer resources.

One simple way to address this is to just have a limit on how many words the system will generate at a time. These systems have such a limit as a kind of safety valve but you don’t usually reach it. Instead the creators of these systems add a special stop word to the training set. You can think of it like a made up word the designers invented that tells the system to stop generating new words if it ever generates this one. How might we teach the model to output this word? That’s part of the next step.

Fine Tuning

GPT3 was good at generating text but it didn’t have any real training on how to give a satisfying response when someone prompted it with a specific request. We saw before that these really large models need a lot of data to train on in order to make good predictions, but it turns out that once the model has gone through a large foundational training, we can get the model to change how it behaves by training it on a much smaller dataset which demonstrates the behavior we want. We refer to this process as fine tuning the model.

You can think about this like, it took a lot of work to teach GPT3 enough about how language works to be able to generate sequences of words that work well together and that learning still carries forward if you want to make it do something more specific like answering a question. Or in an even more human and more loose analogy, once you learn to read, it’s easier to learn a lot of new things.

For fine tuning we might train the model on a lot of good examples of prompts plus responses that humans have rated for quality. The details involve something called reinforcement learning from human feedback (RLHF). We won’t go into all the details here, but a key thing to know about this fine tuning is that they aren’t able to train the model to give correct answers all the time. What they can do is tune it to give answers which sound like good answers. There’s a difference between “correct” and “Sounds good”. In our daily lives we typically use a precise term for this: Bullshit.

ChatGPT is capable of a lot of amazing things, it gives the correct response a whole lot of the time, but at some level, it’s also just very good at bullshit. We see this in various ways when we interact with it but the most common form is often called hallucination or confabulation. ChatGPT wants to give us an answer so bad it will make something up just to impress you. Remind you of anyone you know?

Context Window and Memory

During a long conversation, ChatGPT is able to remember what was said before and that helps it make good replies. The amount of text it can remember is referred to as its context window. We can think about it like the context window is ChatGPT’s memory.

When we ask ChatGPT a question we expect an answer, but sometimes we can get better responses by including some examples of the kinds of answers we want in the original prompt.

If we want ChatGPT to answer multiple choice questions from an AP Art History exam, we can give it the question and see what it responds with but we can do a lot better if part of the prompt includes examples of other exam questions with answers. This is a big part of how ChatGPT is “passing” these exams. See Page 26 here for an example. The prompt isn’t telling ChatGPT the answer to this specific question but it is showing it something about how these tests work in general. Tricks like this are an example of prompt engineering where creating the best prompt we can for the system will make the results more accurate. Another way to think of this is we are coaching the system at the same time we are asking it for help.

Grounding and RAG

One way we can improve the prompt or context for an LLM is to add more information from other data sources to give it more to work with. For example, if we tell the LLM more things about the user such as their age, location or hobbies, then it might be able to tailor its answers to that person’s specifics. We refer to this as grounding the prompt.

There are many systems now which allow you to use the LLM to query your own documents. You upload the documents to a server that the LLM can access and then when you make queries to the LLM there is information extracted from those documents and included in the LLM’s context so that it can use that information to construct its answer. This is often called RAG for retrieval augmented generation as we are helping the LLM generate a better result by retrieving the right information from the documents. Generally the context window for the LLM will be too small for us to include all the information, so how is it selected?

These systems can get pretty complicated but the basic idea is that when the documents are uploaded they are diced up into smaller chunks, perhaps one paragraph long each. Then when a query is made, the first step is to look at all the chunks and try to determine which ones might relate to the query. Only those chunks are then added into the context.

Finding the related chunks generally relies on something called embeddings. These embeddings are a list of numbers that specify where this text fits into a space of all possible embeddings. Similar text will be in similar locations. The details are interesting but the main takeaway is that these embeddings can be quickly compared to find good candidates for the right chunks of text to include. They don’t rely on simple text matches instead they are match items which are similar in concept.

Other tricks

Another way to coach an LLM is to ask it to essentially show its work. If you ask it to solve a complex problem and include in your prompt a request to “think step by step” it is more likely to get the right answer. Asking it to think step by step causes it to generate a list of steps for the solution. These steps are based on other texts that were part of its training which had lists of steps to solve this or a similar problem. These generated steps now become part of the context along with your request and it can use all of this together to choose the next word. It can’t really do this thinking “in its head” because it doesn’t have a head. Writing it all out means it has more to work with as it computes the probabilities for the words in its response.

These models got better at generating results involving steps or procedures after a lot of computer code was added to their training set. It seems this didn’t just make them able to create computer code but increased something about their ability to work with a process.

How intelligent are LLMs?

LLMs can do a lot of things, but since we call this Artificial Intelligence it’s time to ask “How intelligent are LLMs?”. The short answer is we don’t know. That is also the long answer.

LLMs can perform well on some kinds of tests but we don’t really have a good way to know how well that performance generalizes. In a real sense we don’t have a good way to define intelligence and perhaps more research with these and other AI systems will help us make these definitions sharper.

LLMs are actually really good at doing general knowledge queries. Good enough to have companies like Google concerned about the loss of revenue to search. They are also able to perform complex manipulation of text, rewriting it in different styles or lengths. They even show a lot of promise in creating computer code.

Inside its billions of neurons, the LLM has had to learn a lot of concepts in order to be able to get good at these predictions. The thing to understand here is that while we can see what it does, we have almost zero understanding of how it does it. The system is too large to poke around individual neurons and try to decide what they are doing. This is why there’s a lot of debate about how smart these systems really are.

What I hope this series of posts can do is help people form their own opinions and intuition about these systems. I really encourage you to spend significant time with them trying it out for yourself. Kick the tires with ChatGPT or Googe’s Gemini. I can guarantee you will experience at least a few Wow! moments when the LLM does something impressive.

I also recommend paying close attention to any potential biases. We may want to dismiss these tools as only capable of producing mid-level work that isn’t up to the standards of experts. That’s also true of most of us.

On the other hand, we may overestimate how smart they are just because they seem to speak our language. We don’t know of any other animals or systems who have the same facility with language as we do. We consider langage to be a key part of what makes humans special. This means that we tend to believe that if it can speak like us it must be as smart as us. Ability to talk about something implies knowing about it. However, we have all met people who are able to talk well, but it’s not clear they understand anything.

The other thing to pay attention to is how much of what the LLM produces is a result of you coaching it to help it do better. As we chat with an LLM, it’s easy to find ourselves leading it to the right answer. It’s subtle, but if you look for it you might notice that while an LLM is a good partner for you when trying to figure something out or create an essay, the human in the loop is actually providing some extra push to the process that the LLM is unable to provide on its own.

In the next post we return to looking at AI and images.

Key Points

Large Neural Net models like LLMs are expensive to train but once they are well trained they can be fine tuned to more specific tasks with a lot less data and computer resources.
ChatGPT required fine-tuning to function well as a conversational partner. This fine-turning taught it to create answers that look like good answers but they aren’t necessarily correct answers.
LLMs can “remember” previous things that have been a part of a conversation but only up to the limit of its context window.
The quality of answers from any LLM depends a lot on the human user, specifically in how prompts are constructed to help coach the system to give good responses.
We can use grounding to add additional information into the context so that the LLM will be able to produce better answers.
We can’t really say how intelligent these models actually are but their ease of use makes it possible for anyone to try them out and form their own opinion.