CLIP Prompt Engineering for Generative Art

A primer on a what might be the most valuable skillset in the coming years, beyond just generative art.

UPDATE 09/29/2021: This post gives high-level tips for propmpt engineering, but if you want to see more of this in practice, look to the One small step for GAN Instagram page. I cannot recommend this higly enough.

UPDATE 10/16/2021: Listed a claimed fix for one of CLIP’s bugs.

UPDATE 11/07/2021: Added new details about CLIP fine-tuned on datasets beyond ImageNet.

What is CLIP?

CLIP is a transformer model from OpenAI that is used to match text embeddings with image embeddings. The motivation behind CLIP is simple enough: We can get transformers to make representations of text. We can get transformers to make representations of images. If we can get the image and text representations describing the same concepts close to each other, we can easily map images to text (or vice versa).

CLIP, neural search, generative models

For people who are previously used to one-hot encoding categorical variables and manually mapping those to human-intelligible text, CLIP almost seems like cheating.

One featur of this image-to-text mapping is that it makes neural search applications much easier to create for images and image databases. At the image level, it means we can combine CLIP with tools like YOLOv5 and search for objects in images using only descriptions. As an example, here’s the CROP-CLIP project, which uses CLIP to crop out subsets of images.

Crop-CLIP selects the clock from the image when presented with the text prompt “What’s the time?”

Unlike traditional ’Object Detection’, CLIP is not restricted to a predefined set of classes, or a top-down pipeline that requires bounding boxes or instance segmentations. Try it out yourself with @kevin_zakka ’s CLIP Google Colab

At the level of image databases, it means we can search for content even with vague descriptors. Take this example CLIP Search project that allows one to search for images based on the content of the image, rather than the image’s connection to high-traffic websites.

CLIP search presents images of women holding flowers when presented with the text prompt “Girl Holding Flowers”

Conceptually, we can extend this further. We can go beyond searching individual images or finite sets of pre-defined images. We can use CLIP to search the seemingly infinite latent space in which one could find every possible image that can be rendered with a given number of pixels. In short, CLIP makes it possible to guide a generative model to produce new images based on an input string corresponding to almost any possible subject.

For anyone that’s read Jorge Luis Borges’ The Library of Babel, you might be rightly suspicious of claims of easy navigation of infinite spaces. After all, this might seem like navigating infinite possible images has just been traded for navigating infinite possible text strings”. I have two pieces of reassuring news for you:

  1. The latent space is technically not infinite. For 16,777,216 possible hexadecimal RGB color codes, and and 256×256256 \times 256 pixel image (65,536 pixels in total), we could draw C16777216655367.393×10186229\mathrm{C}_{16777216}^{65536} \approx 7.393 \times 10^{186229} possible images. It’s just very large, and many parts are practically indistinguishable in the eyes of humans. It also helps that generative models can navigate this space via manifold interpolation (i.e., what shapes are closest to each other?) instead of relying on linear interpolation (i.e., what is the average between these two points?)
  2. There are plenty of useful and easily-teachable tricks for choosing the best prompts. If you understand the language that CLIP used for learning embeddings (e.g., the OpenAI CLIP was based on English text), you can tweak your prompts easily to find what you want.

    Choosing your generative model to use with CLIP.

CLIP is distinct from the generative model itself, and is not limited to a single model.

For example, consider The PixelDraw-guided CLIP implementation. It’s great for generating the kinds of images you’d expect to be used as assets in 16-bit or faux-retro games. Now consider the Loot NFT project, which is built around providing prompts for things like AI-generated art:

On the Left, the Loot project announcement tweet. On the Right, CLIP-guided pixel art generated from the lines of text in the tweet. Starting at the top right and going clockwise, “Grim Shout” Grave Wand of Skill +1, Hard Leather Armor, Divine Hood, Hard Leather Belt, “Death Root” Ornate Greaves of Skill, Studded Leather Gloves, Necklace of Enlightenment, and Gold Ring.

Another example, consider if we use a purpose-built generator like StyleGAN3, one that’s geared towards generating anything as long as it resembles a human face.

prompt “Upper-class twit, drawn by The New Yorker”. Multiple seed values were planned, but THIS was the first result. It seems the generator couldn’t decide between Jeremy Clarkson and James May from Top Gear/The Grand Tour…so it went with both of them at the same time.

This space becomes even more complex when we try to combine with visual effects like zooming.

However, for the purposes of this post I will be focusing on the outputs of a model that combines a conditional ImageNet classifier with OpenAI’s unconditional generative diffusion model (courtesy of @RiversHaveWings). If you are interested in the other approaches out there, here is a growing list of some of the latest methods:

TechniqueColab linkDescriptionAuthorsNotes
Attn-GANThe OG text-to-image generatornotebook by Derrick SchultzThe original text-to-image colab
Big SleepBigGAN controlled by CLIPRyan MurdockDemo Video
Quick CLIP Guided DiffusionFast CLIP/Guided Diffusion image generationby Katherine Crowson, Daniel Russell, et al.This is the technique we use throughout the rest of this post
S2ML Art GeneratorJustin Bennington
Zoetrope 5.5CLIP-VQGAN toolbearsharktopus

If you want to see the full extent of generators that can be combined with CLIP, you can check out this awesome Github repo

Stylistic Choices

Let’s start off by stressing the largest benefit of using CLIP to create generative art: You can create almost any conceivable style with it. CLIP was trained by pairing images with descriptive text taken from around the internet. A lot of that text had descriptors such as what artistic medium the image was drawn/rendered/created in, who the creator was, what company/organization created/released/published the image, etc.

Below is a demonstration of how CLIP modifies images of the same subject (either a "Mushroom", "Dragon", "Castle on a hill", or "Spaceship", things recognizable across many different cultures), but with more than 200+ different descriptors (prompts in the format "{subject} | {style}").

NOTE: A similar version of this exists for CLIP+VQGAN prompts, but I’ve chosen to use CLIP+Diffusion for this demo.

CLIP-based generative models can emulate an enormous variety of styles from many different periods and cultures.

MushroomDragonSpaceshipCastle on a hill
anime
psychedelic
Soviet
propaganda
Ukiyo-e
Flemish Baroque

If your medium of choice doesn’t belong to any specific time period, that’s doable as well.

MushroomDragonSpaceshipCastle on a hill
pencil sketch

Have a specific videogame in mind? CLIP has embeddings for specific consoles and rendering engines.

MushroomDragonSpaceshipCastle on a hill
PS1 graphics

You can even describe different materials and textures for the subject of your image.

MushroomDragonSpaceshipCastle on a hill
made of
liquid metal

One might not even know the specific name of an artistic school or movement. Just describing a time period by year or decade will do as well.

MushroomDragonSpaceshipCastle on a hill
1990s, 1995
(1962) directed by
cinematography by

When it comes to emulating styles of specific artists, you can go beyond famous ones like Van Gogh or Picasso.

MushroomDragonSpaceshipCastle on a hill
by James Gurney

Because it was trained on image-text pairs scraped from the internet, CLIP can create styles based on stereotypes of the content of websites or publications.

MushroomDragonSpaceshipCastle on a hill
trending on
artstation

This strategy of stereotyping even extends to photography techniques. While it doesn’t produce the resolution of the described camera equipment, it does still try and create a “style” based on images commonly assoicated with the descriptor.

MushroomDragonSpaceshipCastle on a hill
8k resolution

Personally, one of my favorite styles is “by Greg Rutkowski” (Example)

Being mindful of seed values

I should note that all of the examples above have been using a pseudorandom seed value of 0. This might seem like an insignificant detail, but as I’ll demonstrate, this might be one of the most important choices in determining the image’s final form.

Consider the following examples made with the prompt "the apotheosis of the lunatic, by William Blake". This was inspired by @RiversHaveWings’s prompt “the apotheosis of the lunatic, by James Gurney”. I switched out James Gurney for William Blake, because if you’re going to use a word like “apotheosis” in your prompt, it might be interesting to try it in the style of an artist that made some of the most famous post-renaissance artwork out there.

This is what our diffusion-based model outputs based on a seed values of 0, 1337, 80085, 42069, and 113553:

prompt “the apotheosis of the lunatic, by William Blake”, with seed value of 0, 1337, 80085, 42069, and 113553, respectively

This is a pretty striking result. If you look at William Blake’s works, you’ll see plenty of visual similarities. By changing just one number, we can get an enormous variety of different yet stylistically similar outputs.

Recording seed values is also how you reproduce certain interesting outputs. For example, one person created an output for the prompt “Burning Abyss” that seemed to resemble “Burning abbeys” instead. Sadly, I have not been able to reproduce this with VQGAN or any of the various other Diffusion-based models.

Even misspelled words can be turned into prompts

Since CLIP is guided by distances between word embeddings, it’s often influenced by the literal meanings of those words (even if that’s not the intent of the user).

CLIP also still works even if you were to mis-spell the name of the artist whose style you wanted to emulate. Below is a sample of CLIP Diffusion outputs from the correctly spelled "art by rene magritte", along with outputs resulting from the incorrectly spelled art by rene magirrite.

"art by rene magritte"
(CORRECT spelling)
"art by rene magirrite"
(INCORRECT spelling)

The images on the left resemble the more familiar style of René Magritte. The images on the right are often more colorful and often resemble photographs.

At best, you may find a new style that you like. At worst, let this be a lesson to use a spell-checker.

CLIP can take in emojis

We’ve seen that CLIP is versatile. After all, it was trained on a corpus of text scraped from the internet, and there are plenty of misspelled words on the internet.

“But”, you may be asking, “If CLIP is working with text it found on the internet, does that include emojis?”

Yes, yes it does. CLIP can absolutely take one or more emojis, even combined with other text.

Prompt: ”🦩”

Prompt: “🥖🌩️”

Prompt: “The planet of 😈”

Be mindful of Biases associated with the input text

As with any AI application involving embeddings, one needs to be mindful of any societal biases that may be exposed by the input text. One of the benefits of using these embeddings is that we can use cosine distances to place approximate numerical values on those biases. Here are the cosine similarities between various CLIP text embeddings to the text embedding of ""a human being" (for the ViT-B/16 CLIP as of August 29th, 2021). Based these, it seems this version of CLIP considers "a white man" to be closest to a ‘typical’ human being.

textcosine similarity to
“a human being”
"a man"0.927799
"a white man"0.904146
"a woman"0.903923
"a black man"0.893208
"a white woman"0.889220
"a black woman"0.873916
"an Asian man"0.860326
"an Asian woman"0.830546

On the less obviously harmful side, there’s also the phenomenon of metonymy: For example, if you ask for a “clockwork bird,” CLIP thinks “ah, steampunk,” and starts putting top hats on things, incl at times the bird

Consider the classic example in AI fairness studies: Gender biases. Using the same subjects as the style chart above, here’s how CLIP+Diffusion renders them when given gender descriptions as an unweighted input style.

MushroomDragonCastle on a hillSpaceship
androgynous
masculine
genderless
feminine
extremely gendered,
masculine and feminine

Aside from he “genderless” and “androgynous” dragons (which look more like blue sea dragons, sea slugs that are hermaphrodites), there seems to be consistent stylistic patterns for each of the gendered words. The “feminine” spaceship even looks like it’s being held by a hand with manicured fingernails.

As another example: if you mention rainbow in a prompt, then fuzz the embedding, you may see pride flags, flamboyant makeup, etc. in the output. So, if you mention a male subject, CLIP might generate them as trans or femme.

There’s also the issue that understanding numbers is a task that current implementations of CLIP can’t seem to get right. Consider the outputs of the prompt "{N} goldfish swimming in a glass bowl on ArtStation."

NPromptOutput Image
One“One goldfish swimming
in a glass
bowl on ArtStation.”
Two“Two goldfish swimming
in a glass bowl
on ArtStation.”
Three“Three goldfish swimming
in a glass bowl
on ArtStation.”
A Thousand“A Thousand goldfish swimming
in a glass bowl
on ArtStation.”

“One” and “Two” are clearly showing more than that, and “Three” and “A Thousand” seem to be lacking.

UPDATE 10/16/2021: dribnet has proposed a solution to the counting problem. For diffusion-based models, it comes down to reducing the output size, rather than modifying the input prompt. Let this be a lesson that modifying the input prompt isn’t everything.

Weighting different parts of the prompt

We’ve demonstrated many different style modifiers above, but there’s no reason you should just pick one. For example, inpsired by Ariel Ekgren, here’s "portrait of 60s San Francisco Worker | painting by Gurney | 70s antipsychotic painting | portrait | matte painting | trending on artstation" fed into the CLIP Diffusion model.

CLIP-guided Diffusion prompt: “portrait of 60s San Francisco Worker | painting by Gurney | 70s antipsychotic painting | portrait | matte painting | trending on artstation”

The result is a stylistic mix of many of the aforementioned style modifiers.

Still, we can go even further than this, for the simple reason that CLIP can take into account numerical weights. By adding weights to your prompt (e.g., 0.5, +1, -1, +5, -5, etc.), you can get a variety of different outputs with the same descriptors. Consider the following example of changing the relative weights of the subject and the style. We can take a prompt like “a terrifying monster | painting by Greg Rutkowski”, and change the relative weights.

Input prompt
(seed value=1337)
Output Image
"a terrifying monster 2.0 \| painting by Greg Rutkowski 0.0"
" a terrifying monster 1.5 \| painting by Greg Rutkowski 0.5 "
" a terrifying monster 1.0 \| painting by Greg Rutkowski 1.0 "
" a terrifying monster 0.5 \| painting by Greg Rutkowski 1.5 "
" a terrifying monster 0.0 \| painting by Greg Rutkowski 2.0 "

Since we use the same seed value (and thus the same noise pattern for the initialization), the images use a similar color pallete. However, as we change the weighting to favor the artist over the subject, the output looks less like a monster and more like a generic fantasy painting.

Note that these weights are relative to each other. Applying these weights to two styles allows one to get a variety of different mixes of those styles (consider the continuum of cyberpunk and Impressionism).

Oh, and we can also go in the direction of negative weights as well.

Limitations of using just ImageNet-based CLIP for creating art

Ultimately, CLIP is used for matching images to text, based on the results of training CLIP on ImageNet. CLIP is great for creating images containing collections of relevant features that match the words. Sadly, there’s no existing tool for creating a loss function that asks “how aesthetically pleasing is this image?“.

One way this has been addressed is to actively re-train CLIP on a large dataset of images beyond ImageNet. Katherine Crowson is leading one such effort to create a fine-tuned 512x512 diffusion model that has also been trained on a corpus of atwork-text pairs.

Augmenting prompt engineering

If you’ve found particularly useful prompts, but you don’t know where to go from there, you can try using these successful prompts with a language model. For example, if you’ve found a prompt theme that produces a lot of images in a style you like, or you’ve found styles you like, you can use these to generate similar prompts. You can even use GPT-2 or GPT-3 to generate new prompts from these successful ones.

I’ve included an example here of what comes out of OpenAI’s GPT-3 when using past successful prompts. It’s recommended that you turn the temperature up to 1.0 or higher, as this is creativity-intensive by definition (and when turned lower it just seems to repeat "a landscape by Heironymous Bosch")

import os
import openai

openai.api_key = os.getenv("USE_YOUR_OWN_OPENAI_API_KEY")

start_sequence = "- "
restart_sequence = "."

response = openai.Completion.create(
  engine="davinci",
  prompt="- utopia, a world at peace.\n- the US Capitol Building by Wassily Kandinsky.\n- Van Gogh's Starry Night in the style of ukiyo-e.\n- apotheosis of the lunatic by William Blake.\n- cosmic love and affection, video game screenshot.\n- flames of The Lovers tarot card in the style of ukiyo-e.",
  temperature=1,
  max_tokens=1024,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0
)

Now, per the OpenAI developer agreement I cannot show you the full outputs, but here is an interesting sample:

- cockroaches theme park by Sula Luna.
- SQUIRREL!!! Ninja Warfare!!!.
- overgrown ants by Wayne Barlowe.
- zombies playing poker.
- architecture by a child.
- clowns by Konstantin Maslov.
- portraits by Anna Podedworna.
- crime and punishment by Kate Spedding.
- the inventor Filippo Tommaso Marinetti.
- a Roman gladiator by Omar Rayyan.
- a blue tiger by Roman Furman.
- Grains of sand by Andreas Vesalius.
- the artist and thinker Ala Stoyanov.
- the Starry Night by Marc Chagall.
- the romantic painter John William Waterhouse.
- Submarine by Fernando Botero.
- Miro's Guernica.

Feel free to use any of these above prompts in your own CLIP-guided image generation projects.

Where is this useful outside of generative art?

This technique gained notoriety in the Generative Art community last year with the announcement of DALL-E from OpenAI, but before that it had been used in getting the most out of language models like GPT-2 and GPT-3.

Already, OpenAI has been working on a code-specific version of GPT-3 known as OpenAI Codex. This has already been rolled out in product form as Github Copilot (though OpenAI also has a beta for using the codex API directly). Codex is still a long way from replacing software engineers, but it’s worth paying attentiont to how much GPT-3 models have advanced in just a few years. OpenAI is already working on GPT-4. While GPT-4 will not be too different in size from GPT-3, it will incorporate newer improvements in model-training specifically with code-generation in mind.

Ultimately, exercising your prompt engineering skills in generative art might be a precursor to becoming a skilled prompt engineer for AI-assisted coding.

References

Cited as:

@article{mcateer2021clippe,
    title = "CLIP Prompt Engineering for Generative Art",
    author = "McAteer, Matthew",
    journal = "matthewmcateer.me",
    year = "2021",
    url = "https://matthewmcateer.me/blog/clip-prompt-engineering/"
}

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me] and I will be very happy to correct them right away! Alternatily, you can follow me on Twitter and reach out to me there.

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!


This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

At least this isn't a full-screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about AI, Biotech, and a bunch of other topics.


This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.