CLIP Prompt Engineering for Generative Art
A CLIP primer with 3500+ prompt output examples.
| UPDATED
UPDATE 09/29/2021: This post gives high-level tips for propmpt engineering, but if you want to see more of this in practice, look to the One small step for GAN Instagram page. I cannot recommend this higly enough.
UPDATE 10/16/2021: Listed a claimed fix for one of CLIP’s bugs.
UPDATE 11/07/2021: Added new details about CLIP fine-tuned on datasets beyond ImageNet.
UPDATE 12/15/2021: If you’re just interested in the colab notebooks that allow you to create these images, I’ve added a section at the end with all the links (I still recommend reading through this full post).
What is CLIP?
CLIP is a transformer model from OpenAI that is used to match text embeddings with image embeddings. The motivation behind CLIP is simple enough: We can get transformers to make representations of text. We can get transformers to make representations of images. If we can get the image and text representations describing the same concepts close to each other, we can easily map images to text (or vice versa).
CLIP, neural search, generative models
For people who are previously used to one-hot encoding categorical variables and manually mapping those to human-intelligible text, CLIP almost seems like cheating.
One featur of this image-to-text mapping is that it makes neural search applications much easier to create for images and image databases. At the image level, it means we can combine CLIP with tools like YOLOv5 and search for objects in images using only descriptions. As an example, here’s the CROP-CLIP project, which uses CLIP to crop out subsets of images.
-side.png)
Unlike traditional ’Object Detection’, CLIP is not restricted to a predefined set of classes, or a top-down pipeline that requires bounding boxes or instance segmentations. Try it out yourself with @kevin_zakka ’s CLIP Google Colab
At the level of image databases, it means we can search for content even with vague descriptors. Take this example CLIP Search project that allows one to search for images based on the content of the image, rather than the image’s connection to high-traffic websites.

Conceptually, we can extend this further. We can go beyond searching individual images or finite sets of pre-defined images. We can use CLIP to search the seemingly infinite latent space in which one could find every possible image that can be rendered with a given number of pixels. In short, CLIP makes it possible to guide a generative model to produce new images based on an input string corresponding to almost any possible subject.
For anyone that’s read Jorge Luis Borges’ The Library of Babel, you might be rightly suspicious of claims of easy navigation of infinite spaces. After all, this might seem like navigating infinite possible images has just been traded for navigating infinite possible text strings”. I have two pieces of reassuring news for you:
- The latent space is technically not infinite. For 16,777,216 possible hexadecimal RGB color codes, and and pixel image (65,536 pixels in total), we could draw possible images. It’s just very large, and many parts are practically indistinguishable in the eyes of humans. It also helps that generative models can navigate this space via manifold interpolation (i.e., what shapes are closest to each other?) instead of relying on linear interpolation (i.e., what is the average between these two points?)
-
There are plenty of useful and easily-teachable tricks for choosing the best prompts. If you understand the language that CLIP used for learning embeddings (e.g., the OpenAI CLIP was based on English text), you can tweak your prompts easily to find what you want.
Choosing your generative model to use with CLIP.
CLIP is distinct from the generative model itself, and is not limited to a single model.
For example, consider The PixelDraw-guided CLIP implementation. It’s great for generating the kinds of images you’d expect to be used as assets in 16-bit or faux-retro games. Now consider the Loot NFT project, which is built around providing prompts for things like AI-generated art:

Another example, consider if we use a purpose-built generator like StyleGAN3, one that’s geared towards generating anything as long as it resembles a human face.

This space becomes even more complex when we try to combine with visual effects like zooming.
Sound 7#flumesounds #AIart #nftcollector pic.twitter.com/ihTqIEegLP
— mitri ❣️🔥❣️ (@squall222222) October 3, 2021
However, for the purposes of this post I will be focusing on the outputs of a model that combines a conditional ImageNet classifier with OpenAI’s unconditional generative diffusion model (courtesy of @RiversHaveWings). If you are interested in the other approaches out there, here is a growing list of some of the latest methods:
Technique | Colab link | Description | Authors | Notes |
---|---|---|---|---|
Attn-GAN | The OG text-to-image generator | notebook by Derrick Schultz | The original text-to-image colab | |
Big Sleep | BigGAN controlled by CLIP | Ryan Murdock | Demo Video | |
Quick CLIP Guided Diffusion | Fast CLIP/Guided Diffusion image generation | by Katherine Crowson, Daniel Russell, et al. | This is the technique we use throughout the rest of this post | |
S2ML Art Generator | Justin Bennington | |||
Zoetrope 5.5 | CLIP-VQGAN tool | bearsharktopus |
If you want to see the full extent of generators that can be combined with CLIP, you can check out this awesome Github repo
Stylistic Choices
Let’s start off by stressing the largest benefit of using CLIP to create generative art: You can create almost any conceivable style with it. CLIP was trained by pairing images with descriptive text taken from around the internet. A lot of that text had descriptors such as what artistic medium the image was drawn/rendered/created in, who the creator was, what company/organization created/released/published the image, etc.
Below is a demonstration of how CLIP modifies images of the same subject (either a "Mushroom"
, "Dragon"
, "Castle on a hill"
, or "Spaceship"
, things recognizable across many different cultures), but with more than 200+ different descriptors (prompts in the format "{subject} | {style}"
).
NOTE: A similar version of this exists for CLIP+VQGAN prompts, but I’ve chosen to use CLIP+Diffusion for this demo.
You can also click on all the thumbnail images below to see a higher resolution.
First, let’s consider how CLIP guides the diffusion process for 4 different subjects, without any style modifiers.
Mushroom | Dragon | Spaceship | Castle on a Hill | |
---|---|---|---|---|
(no style modifiers) |
![]() |
![]() |
![]() |
![]() |
CLIP-based generative models can emulate an enormous variety of styles from many different periods and cultures.
If your medium of choice doesn’t belong to any specific time period, that’s doable as well.
Have a specific videogame in mind? CLIP has embeddings for specific consoles and rendering engines.
You can even describe different materials and textures for the subject of your image.
One might not even know the specific name of an artistic school or movement. Just describing a time period by year or decade will do as well.
Though one needn’t be as specific as a definite year or decade. Another option would be to describe the style of a specific country, political faction, organization, or movement.
When it comes to emulating styles of specific artists, you can go beyond famous ones like Van Gogh or Picasso.
Because it was trained on image-text pairs scraped from the internet, CLIP can create styles based on stereotypes of the content of websites or publications.
This strategy of stereotyping even extends to photography techniques. While it doesn’t produce the resolution of the described camera equipment, it does still try and create a “style” based on images commonly assoicated with the descriptor.
And then there are plenty of styles that defy classification. CLIP can handle those, too.
Personally, one of my favorite styles is “by Greg Rutkowski” (Example)
Being mindful of seed values
I should note that all of the examples above have been using a pseudorandom seed value of 0
. This might seem like an insignificant detail, but as I’ll demonstrate, this might be one of the most important choices in determining the image’s final form.
Consider the following examples made with the prompt "the apotheosis of the lunatic, by William Blake"
. This was inspired by @RiversHaveWings’s prompt “the apotheosis of the lunatic, by James Gurney”. I switched out James Gurney for William Blake, because if you’re going to use a word like “apotheosis” in your prompt, it might be interesting to try it in the style of an artist that made some of the most famous post-renaissance artwork out there.
This is what our diffusion-based model outputs based on a seed values of 0
, 1337
, 80085
, 42069
, and 113553
:

This is a pretty striking result. If you look at William Blake’s works, you’ll see plenty of visual similarities. By changing just one number, we can get an enormous variety of different yet stylistically similar outputs.
Recording seed values is also how you reproduce certain interesting outputs. For example, one person created an output for the prompt “Burning Abyss” that seemed to resemble “Burning abbeys” instead.
Burning Abyss https://t.co/xdhNF3BVYg pic.twitter.com/jqaIaUmOMx
— images_ai (@images_ai) September 21, 2021
This result was achieved with a VQGAN-based generator. Nonetheless, I assumed that if the CLIP embeddings were close enough then this behavior could be reproduced with a diffusion model. After sampling the diffusion model a bunch of times, I got…well, close enough.

The burning abyss/abbeys confusion seems to be slightly more pronounced in some cases than others, but none of them are as pronounced in the example from Twitter. It is certainly possible that a certain seed value could result in the Diffusion model producing an image just as exaggerated, but that would likely involve searching a theoretically infinite space of seed values.
Let this be a reminder: If you want to continue your work on a particular prompt output, make sure to save the seed value.
Even misspelled words can be turned into prompts
Since CLIP is guided by distances between word embeddings, it’s often influenced by the literal meanings of those words (even if that’s not the intent of the user).
CLIP also still works even if you were to mis-spell the name of the artist whose style you wanted to emulate. Below is a sample of CLIP Diffusion outputs from the correctly spelled "art by rene magritte"
, along with outputs resulting from the incorrectly spelled art by rene magirrite
.
"art by rene magritte" (CORRECT spelling) | "art by rene magirrite" (INCORRECT spelling) |
---|---|
![]() | ![]() |
The images on the left resemble the more familiar style of René Magritte. The images on the right are often more colorful and often resemble photographs.
At best, you may find a new style that you like. At worst, let this be a lesson to use a spell-checker.
CLIP can take in emojis
We’ve seen that CLIP is versatile. After all, it was trained on a corpus of text scraped from the internet, and there are plenty of misspelled words on the internet.
“But”, you may be asking, “If CLIP is working with text it found on the internet, does that include emojis?”
Yes, yes it does. CLIP can absolutely take one or more emojis, even combined with other text.




Be mindful of Biases associated with the input text
As with any AI application involving embeddings, one needs to be mindful of any societal biases that may be exposed by the input text. One of the benefits of using these embeddings is that we can use cosine distances to place approximate numerical values on those biases. Here are the cosine similarities between various CLIP text embeddings to the text embedding of ""a human being"
(for the ViT-B/16 CLIP as of August 29th, 2021). Based these, it seems this version of CLIP considers "a white man"
to be closest to a ‘typical’ human being.
text | cosine similarity to “a human being” |
---|---|
"a man" | 0.927799 |
"a white man" | 0.904146 |
"a woman" | 0.903923 |
"a black man" | 0.893208 |
"a white woman" | 0.889220 |
"a black woman" | 0.873916 |
"an Asian man" | 0.860326 |
"an Asian woman" | 0.830546 |
On the less obviously harmful side, there’s also the phenomenon of metonymy: For example, if you ask for a “clockwork bird,” CLIP thinks “ah, steampunk,” and starts putting top hats on things, incl at times the bird
Consider the classic example in AI fairness studies: Gender biases. Using the same subjects as the style chart above, here’s how CLIP+Diffusion renders them when given gender descriptions as an unweighted input style.
Mushroom | Dragon | Castle on a hill | Spaceship | |
---|---|---|---|---|
androgynous | ![]() |
![]() |
![]() |
![]() |
masculine | ![]() |
![]() |
![]() |
![]() |
genderless | ![]() |
![]() |
![]() |
![]() |
feminine | ![]() |
![]() |
![]() |
![]() |
extremely gendered, masculine and feminine |
![]() |
![]() |
![]() |
![]() |
There seems to be consistent stylistic patterns for each of the gendered words. For example, the “feminine” spaceships look a lot like the various “X for women” products you’d see in a store. Across all the different subjects, a “masculine” style seems to manifest as lacking any pink or magenta.
As another example: if you mention rainbow
in a prompt, then fuzz the embedding, you may see pride flags, flamboyant makeup, etc. in the output. So, if you mention a male subject, CLIP might generate them as trans or femme.
While not exactly a bias, there’s also the issue that understanding numbers is a task that current implementations of CLIP can’t seem to get right.
Rather than direct counting, CLIP seems more to rely on biases towards descriptors of quantity in image captions as a heuristic.
Consider the outputs of the prompt "{N} goldfish swimming in a glass bowl trending on ArtStation."
N | Prompt | Output Image |
---|---|---|
One | “One goldfish swimming in a glass bowl trending on ArtStation.” | ![]() |
Two | “Two goldfish swimming in a glass bowl trending on ArtStation.” | ![]() |
Three | “Three goldfish swimming in a glass bowl trending on ArtStation.” | ![]() |
A Thousand | “A Thousand goldfish swimming in a glass bowl trending on ArtStation.” | ![]() |
“One” and “Two” are clearly showing more than that, and “Three” and “A Thousand” seem to be lacking.
UPDATE 10/16/2021: dribnet has proposed a solution to the counting problem. For diffusion-based models, it comes down to reducing the output size, rather than modifying the input prompt. Let this be a lesson that modifying the input prompt isn’t everything.
Even with solutions to the counting problem, there’s still the issue of CLIP-based generative models not being able to handle precisely parametrized prompts.
For example, there’s this “accidental renaissance” meme that’s been going around that simply involves overlaying the golden ratio or golden spiral over an image, and claiming that this makes it as well-composed as a Renaissance painting. These lines are often precise to the pixel (often created with software like PhiMatrix Golden Ratio Design and Analysis for accuracy), but the contexts are ridiculous. Despite the ridiculous premise, this level of exact parametrizaton may be hard for CLIP-based generative models to handle.
It’s possible that this specific edge case might be due to clip being trained on image captions falsely describing memes as obeying the golden ratio. That being said, such generative models also struggle with creating things like a perfect RGB grid or an exact trigonometric function plot.
Weighting different parts of the prompt
We’ve demonstrated many different style modifiers above, but there’s no reason you should just pick one. For example, inpsired by Ariel Ekgren, here’s "portrait of 60s San Francisco Worker | painting by Gurney | 70s antipsychotic painting | portrait | matte painting | trending on artstation"
fed into the CLIP Diffusion model.

The result is a stylistic mix of many of the aforementioned style modifiers.
Still, we can go even further than this, for the simple reason that CLIP can take into account numerical weights. By adding weights to your prompt (e.g., 0.5, +1, -1, +5, -5, etc.), you can get a variety of different outputs with the same descriptors. Consider the following example of changing the relative weights of the subject and the style. We can take a prompt like “a terrifying monster | painting by Greg Rutkowski”, and change the relative weights.
Input prompt (seed value=1337) |
Output Image |
---|---|
"a terrifying monster 2.0 | painting by Greg Rutkowski 0.0" |
![]() |
"a terrifying monster 1.5 | painting by Greg Rutkowski 0.5 " |
![]() |
"a terrifying monster 1.0 | painting by Greg Rutkowski 1.0 " |
![]() |
"a terrifying monster 0.5 | painting by Greg Rutkowski 1.5 " |
![]() |
"a terrifying monster 0.0 | painting by Greg Rutkowski 2.0 " |
![]() |
Since we use the same seed value (and thus the same noise pattern for the initialization), the images use a similar color pallete. However, as we change the weighting to favor the artist over the subject, the output looks less like a monster and more like a non-specific fantasy painting (abeit heavier on the oil in the painting, since the "painting"
part of the prompt is being over-weighted as well as the "by Greg Rutkowski"
style).
Note that these weights are relative to each other. Applying these weights to two styles allows one to get a variety of different mixes of those styles (consider the continuum of cyberpunk and Impressionism).
Oh, and we can also go in the direction of negative weights as well.
Limitations of using just ImageNet-based CLIP for creating art
Ultimately, CLIP is used for matching images to text, based on the results of training CLIP on ImageNet. CLIP is great for creating images containing collections of relevant features that match the words. Sadly, there’s no existing tool for creating a loss function that asks “how aesthetically pleasing is this image?“.
One way this has been addressed is to actively re-train CLIP on a large dataset of images beyond ImageNet. Katherine Crowson is leading one such effort to create a fine-tuned 512x512 diffusion model that has also been trained on a corpus of atwork-text pairs.
Augmenting prompt engineering
If you’ve found particularly useful prompts, but you don’t know where to go from there, you can try using these successful prompts with a language model. For example, if you’ve found a prompt theme that produces a lot of images in a style you like, or you’ve found styles you like, you can use these to generate similar prompts. You can even use GPT-2 or GPT-3 to generate new prompts from these successful ones.
I’ve included an example here of what comes out of OpenAI’s GPT-3 when using past successful prompts. It’s recommended that you turn the temperature
up to 1.0
or higher, as this is creativity-intensive by definition (and when turned lower it just seems to repeat "a landscape by Heironymous Bosch"
)
import os
import openai
openai.api_key = os.getenv("USE_YOUR_OWN_OPENAI_API_KEY")
start_sequence = "- "
restart_sequence = "."
response = openai.Completion.create(
engine="davinci",
prompt="- utopia, a world at peace.\n- the US Capitol Building by Wassily Kandinsky.\n- Van Gogh's Starry Night in the style of ukiyo-e.\n- apotheosis of the lunatic by William Blake.\n- cosmic love and affection, video game screenshot.\n- flames of The Lovers tarot card in the style of ukiyo-e.",
temperature=1,
max_tokens=1024,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
Now, per the OpenAI developer agreement I cannot show you the full outputs, but here is an interesting sample:
- cockroaches theme park by Sula Luna.
- SQUIRREL!!! Ninja Warfare!!!.
- overgrown ants by Wayne Barlowe.
- zombies playing poker.
- architecture by a child.
- clowns by Konstantin Maslov.
- portraits by Anna Podedworna.
- crime and punishment by Kate Spedding.
- the inventor Filippo Tommaso Marinetti.
- a Roman gladiator by Omar Rayyan.
- a blue tiger by Roman Furman.
- Grains of sand by Andreas Vesalius.
- the artist and thinker Ala Stoyanov.
- the Starry Night by Marc Chagall.
- the romantic painter John William Waterhouse.
- Submarine by Fernando Botero.
- Miro's Guernica.
Feel free to use any of these above prompts in your own CLIP-guided image generation projects.
Where is this useful outside of generative art?
This technique gained notoriety in the Generative Art community last year with the announcement of DALL-E from OpenAI, but before that it had been used in getting the most out of language models like GPT-2 and GPT-3.
Already, OpenAI has been working on a code-specific version of GPT-3 known as OpenAI Codex. This has already been rolled out in product form as Github Copilot (though OpenAI also has a beta for using the codex API directly). Codex is still a long way from replacing software engineers, but it’s worth paying attentiont to how much GPT-3 models have advanced in just a few years. OpenAI is already working on GPT-4. While GPT-4 will not be too different in size from GPT-3, it will incorporate newer improvements in model-training specifically with code-generation in mind.
Ultimately, exercising your prompt engineering skills in generative art might be a precursor to becoming a skilled prompt engineer for AI-assisted coding.
Colab Notebooks
As mentioned previously, here is the table of useful colab notebooks:
-
Disco Diffusion v5.2 (w/ VR Mode), developed by
- In case of confusion, Disco is the name of this notebook edit. The diffusion model in use is Katherine Crowson’s fine-tuned 512x512 model
- For issues, join the Disco Diffusion Discord or message us on twitter at @somnai_dreams or @gandamu
- See this twitter thread by Stephen Young @KyrickYoung on the various parameters and their results.
-
CompVis Latent Diffusion (LDM-TXT2IM.ipynb)
- https://arxiv.org/abs/2112.10752
- Original repo: https://github.com/CompVis/latent-diffusion
- Enhanced repo by @RiversHaveWings: https://github.com/crowsonkb/latent-diffusion
- Colab optimizations taken from @multimodalart: https://github.com/multimodalart/latent-diffusion-notebook
- Shortcut to this notebook: bit.ly/txt2im
- Notebook by: Eyal Gruss (@eyaler)
- A curated list of online generative tools: j.mp/generativetools
- pyttitools beta notebook
-
- By John David Pressman (https://twitter.com/jd_pressman, https://github.com/JD-P) and Katherine Crowson (https://github.com/crowsonkb, https://twitter.com/RiversHaveWings).
- It uses a 602M parameter diffusion model trained on Conceptual 12M. See the GitHub repo for more information: https://github.com/crowsonkb/v-diffusion-pytorch.
-
colab to enhance upscale/cleanup images with RealESRGAN, by vadim epstein @eps696
- The value of this one is that it uses cool model by Peter Baylies, finetuned on the paintings: https://archive.org/details/hr-painting-upscaling
- Besides nice painterly effect, which this model produces in general, it very well smooths out the noisy imagery of my Aphantasia/Illustrip generators.
- For higher resolution inputs it may make sense to scale down the input first (
'pre_downscale'
checkmark), that should make result cleaner.
-
Prompt Parrot v1.2 for generating new prompts from GPT-2 by Stephen Young @KyrickYoung
- Now the parrot will set the initial text from starting words in your prompts. This means more prompt variety for less effort.
Now the parrot will set the initial text from starting words in your prompts! This means more prompt variety for less effort! 🦜🤖
Acknowledgements
All of the above discoveries and techniques and code have been made possible by the tireless efforts of many AI artists. This includes, but is not limited to:
- Midjourney (@midjourney) - New research lab. Exploring new mediums of thought. Expanding the imaginative powers of the human species. Apply to our beta: http://bit.ly/3J2NNVs
- ya @sportsracer48 - AI artist / alchemist. Looking to reduce art to personal expression. http://patreon.com/sportsracer48
- crunchie @0xCrung - stuff i made with neural networks. she/her
- aydao @AydaoAI - Creator of “This Anime Does Not Exist” model.
- Or Patashnik @OPatashnik
- nshepperd @nshepperd1 - no
- Adverb @advadnoun - Cushbrooch Authored the BigSleep notebook and originated the approach of combining VQGAN & CLIP he/him
- deKxi @deKxi - AI / ML | (Technical) Artist | Game & Soft Dev | Musician | 🇦🇺 Accumulating hobbies one hyperfocus at a time. http://youtube.com/c/deKxiAU http://foundation.app/@deKxi
- Joel Simon (in nyc) @_joelsimon - Computational art, tools & research. Founder / director
- Studio Morphogen @studiomorphogen - Evolving floor-plans, corals, @artbreeder http://simonlamps.com
- Rivers Have Wings @RiversHaveWings - AI/generative artist. Writes her own code. #EleutherAI
- Xander Steenbrugge @xsteenbrugge - Independent AI researcher, digital artist, public speaker, online educator and founder of the http://wzrd.ai digital media platform.
- Nerdy Rodent (he/him) 💻🌌 @NerdyRodent - 🐀 Likes rodents, makes images with AI :) HIC, OpenSea, Foundation, etc links - http://linktr.ee/NerdyRodent
- vadim epstein @eps696 - multimedia vandal, numeric taxidermist :: ML CG VJ :: new media, old school http://foundation.app/eps http://teia.art/eps696 http://patreon.com/eps696
- Peter Baylies @pbaylies - Learning deeply; quietly curating my gallery: https://teia.art/pbaylies
- Doron Adler @Norod78 - Fooling around with Generative machine learning models. he/him
- Justin Pinkney @Buntworthy - Playing with deep learning, computer vision and generative art. http://hicetnunc.art/justinpinkney/…
- Daniel Russell @danielrussruss - Obsessed with AI-assisted media and creating new tools, to compensate for my aphantasia | Sr. Software Engineer at midjourney
- derrick has started yet another project @dvsch - 1300yo cat IRL. BLM, ACAB. He/Him.
- Eyal Gruss @eyaler
References
- Prompt Engineering: The Magic Words to using OpenAI’s CLIP
- OpenAI’s CLIP Prompt Engineering Jupyter Notebook
- Reddit r/MachineLearning - List of Sites/Programs using OpenAI’s CLIP
- Reddit r/GPT3 - The problem with prompt engineering
- GirlGeek - Prompt Design engineering for GPT3
- VQGAN CLIP
- UC Berkeley - CLIP Art
- 2 new workflows for VQGAN CLIP Art on NightCafe
- Roboflow - AI Generated art
torch.manual seed(3407)
is all you need: On the influence of random seeds in deep learning architectures for computer vision- CLIPDraw: Exploring Text-to-Drawing Synthesis
Cited as:
@article{mcateer2021clippe,
title = "CLIP Prompt Engineering for Generative Art",
author = "McAteer, Matthew",
journal = "matthewmcateer.me",
year = "2021",
url = "https://matthewmcateer.me/blog/clip-prompt-engineering/"
}
If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at matthewmcateer dot me
] and I will be very happy to correct them right away! Alternatily, you can follow me on Twitter and reach out to me there.
See you in the next post 😄