Explaining What’s up with AI (Part 5 – Text to Image)

Part 1 of the series is here.

After spending time talking about LLMs and what they can do with text, Let’s go back to looking at images. Text to image generators allow you to just type something like:

“create an image of a cat wearing a blue baseball cap”

And you get:

A cat with a hat

There are several tools out there that can do this kind of thing now. The image above was created with DALL-E from OpenAI. These systems aren’t finding an image out there on the internet, they create the image from the prompt given to them.

Here’s other examples from the same prompt:

Based on what we have seen so far, how can this possibly work? How would we train a model to take a text description of an image and generate a new unique image each time? These systems work in different ways and being able to fully understand them is a tall order. What I’d like to focus on in this post is to use systems like this as an example of how machine learning scientists have been able to build powerful systems by combining together different models that work in concert.

Let’s talk about some models which might be building blocks towards being able to create a text to image system.


The first model to consider is called CLIP. The builders of the model were able to collect 400 million images along with text about those images. The did this using a lot of images from the internet along with captions and the text that is added to images for visually impaired users. This allowed them to train the model to be very good at predicting whether or not some text matches an image. If we take an image along with a piece of text CLIP is able to produce a score for how well the image and text match. It is able to do this well even for images which were not part of its training set. 

It won’t directly generate an image from a text but we could use this model if we were building a tool that searches for images which best match a given text. It can also be used to see which of a set of texts best matches the image.

An image of a television studio and some ratings for different descriptions


Sometimes images have noise in them. You might see this when you take a photo with a camera in very low light.  We might want to remove this noise. How could we remove this noise using machine learning? How would we get data to train the model?  We could start with normal images and make them noisier and noisier.

By the end we just have some noise.

We can take a lot of images through this process and then we can use all those noisy images to train a model to do the opposite. Given an image that contains a known amount of noise, the model would be trained to produce the best image it can with less noise. We have a lot of noisy inputs and a lot of the corresponding less noisy desired results so we can do a lot of training.  We don’t know exactly what the model is learning but perhaps it’s learning to detect shapes, lines or other features so it can accurately reconstruct something which represents what is in the current image but with less noise. It might have to make things up in order to do this but you will get a result that seems nicer. There’s a good chance you have a phone with a camera that is already doing something like this.


Sometimes we would like images to be larger. If we have a small image of a cat that is something like 128×128 pixels it won’t have very many details. We might want a larger version of it and we can put this image into any kind of image software to change its size to 1024×1024 pixels, but the result will just be a pretty blurry larger image.

If we had a lot of images to train on we might be able to train a machine learning model to do better. We could get the training data for this by starting with a bunch of 1024 by 1024 images and then scaling them down to 128×128. Now we have a large training set to teach the machine learning model to go in the other direction. The model won’t be able to see detail that isn’t there but it will learn to scale things up in ways which look reasonable.

Putting it together

Now that we have seen these 3 kinds of image models, we will try to put them together into order to be able to render an image based on a prompt like “cat wearing a blue baseball cap”

First let’s consider just the model for removing noise. Suppose we start with a completely random noise image and ask it to remove the noise. It will do its best and make something that has a little less noise in it. We can think about it like the model is trying to find patterns and if it cannot it will try to make some up, just like the LLMs we have seen before will. Machine learning engineers sometimes refer to this as dreaming. Anyway, we keep applying a model to remove noise, eventually we will get an image of something. We just have no idea what it will be.

That’s where a model like CLIP comes in. Since CLIP is able to relate text descriptions and images we can use this kind of model to control how the model that removes noise decides what to dream up from the random noise we started with. The math becomes very complicated here but essentially we can align the data for the two models and evaluate them together so that as we try to find patterns in the noise we find patterns which align with the text we were originally given. We are conditioning the model to dream up a cat wearing a blue baseball cap.

This will get us what we wanted but unfortunately the computation here can be very expensive. Many of these systems are only able to produce pretty small images in a reasonable amount of time. That’s where the zoom model above comes in. We can take a small image (maybe only 64×64) pixels and use another model to increase the size so that it is big enough to meet the needs of the user.

Newer Enhancements

These systems have become pretty fancy lately and also include a lot of ways the users can control how the image is made. Some examples are:

  • You can select a part of the image to update while leaving the rest alone. 
  • You can also ask the system to extend the image beyond the bounds of the current image.
  • You can start with your own images or crude drawing to control the result
  • Systems such as Pika labs  can create full videos from text prompts.

When we first saw these text to image generation systems, there was (and still is) a lot of worry about how this might put artists out of work. As the systems become more controllable we are seeing lots of ways that artists can use them to create new things which would be much harder or impossible without these tools. If this kind of stuff interests you I recommend following Bilawal Sidhu. He posts a lot of things like this.

LLMs and Multimodality

Some of the most recent LLM models such as GPT4 and Google Bard are trained on both image and text data. We say these models are multimodal. This opens up a lot of new possibilities for using these models because they can look at an image and then describe what is there using text. That text can then be used as inputs for further questions or other text-to-image generations. We are just beginning to see what the possibilities are here.

In the next post, we’ll cover a bit more about the dark side of this new AI technology.

Key Points

  • Scientists developed a model called CLIP which can be used to score how similar an image is to a text description
  • Machine learning models can also be used to remove noise from images or increase their resolution
  • We can combine several different models to create more powerful AI systems
  • Systems that can create images from text are evolving and can be used as part of a powerful art creation pipeline and can even create video.
  • LLMs are now also trained on images so they can interpret them as well as text. These multimodal models will likely be standard in the future.






Leave a Reply

Your email address will not be published. Required fields are marked *