✅ Automation: AI Generated Art Using Stable Diffusion
Phillip Visual Art DK30 Fall 2022 3 2
Description
🤖 RISE OF THE MACHINES 🤖
I want to explore generating artwork using AI generation from Stable Diffusion. I want to play around with learning more about Stable Diffusion , Midjourney, and Craiyon (DALL-E) and using it as a good artwork conceptual generator.
The main goal is to use hardware and Stable Diffusion to create a comic book story. The updates will show images and a comparison of images.
This project will run in parallel with the automated music project here https://day9.tv/dk30/project/63446b35499cfd1e8e37fe35. They’re both automated projects. #RiseOfTheMachines
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Project Completed ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! A Stable Diffusion retelling of…
Mike Mulligan and his steam shovel by Virginia Lee Burton
Recent Updates
ADDENDUM SPOILERS
Here is a valiant test on upscaling a video using a standard ESRGAN AI upscaler.
The original image source used is from a section of the ending cinematic of The Legend of Dragoon (Playstation video game).
I do not think the upscaling of the ESRGAN is intended for this particular use.
I was going to do a longer cinematic part but decided the short video shows off the result enough.
The original source was already large and it is possible that posed a challenge to the model of the ESRGAN which would have been tested and meant to be implemented on smaller blurred images.
My conclusion on this is to use a better ESRGAN-type model meant for upscaling and filling in large images from video.
More resources:
- GIMP cumulative Layers: https://graphicdesign.stackexchange.com/questions/55311/how-can-i-modify-a-gif-that-uses-cumulative-layers
- G’Mic (Moving images using vector-point arts in GIMP): https://www.youtube.com/watch?v=iXG5x7ilDlU
Not entirely what I was expecting. A simple retelling with a few omissions…
I present . . . . . . . . .
Mike Mulligan and his steam shovel by Virginia Lee Burton
(Original Source: https://www.youtube.com/watch?v=6QbeDVo7wx4)
Terrible work by myself. But in context not bad. This is without dialing Stable diffusion in at all. I just guessed at the art using prompts. I did add the prompts as an array so it’d make 50 pictures of each without being attended.
**In totality it may have taken 12-14 hours to make. **
- Mostly unattended 9-11 hours of Stable Diffusion generation of 50 x 35 = 5035 images on an AMD graphics card.
- And then a mundane 3 hours by video edit.
It looks like “steam shovel” may not be an exact term it learned in its text encoder (i.e. tjhe decipher)
…in retrospect I should have just made 10 images and put them in chronological order. Editing video takes time. And energy when you’re not as invested using commercially free products like Shotcut or EdenLive. There are much better ways to have done the video so I could have spent more time polishing the content of the video.
This wasn’t my original goal but it certainly was a milestone.
I can very well see Stable Diffusion used to tell childrens’ tales. And I like the symbolism to this AI technology still in its “infancy” even with all the collaborative hard work from multiple individuals.
Very much looking forward to where this automation and the position of “artists” end up. “Artists” may be a more marketeering term for someone that brings your ideas to a wholesome connected experience more than the art or talent itself. I hope this software will mature along with our creative minds.
🤖RISE OF THE MACHINES🤖
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Project completed. ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
[I am running an additional test over the course of two-three days. Testing ESRGAN “upscale” use on a video game, The Legend of Dragoon, cinematic. ]
Let’s try re-imagining and bringing a children’s book to life.
Mike Mulligan and His Steam Shovel by Virginia Lee Burton
https://www.youtube.com/watch?v=6QbeDVo7wx4
… … … … … …
BACK TO the regularly scheduled program…
Here I wanted to show off the GFP-GAN below but didn’t want the simplicity between two frams I’ve used before to show things here. I thought to myself there has gotta be a tool for that too! Introducing Frame Interpolation!
Voila!
But let’s not stop at showing off the GFP-GAN results from original. Let’s try some prompts from Stable Diffusion that had addendum’s so the prompt didn’t change significantly but just enough.
We’re getting somewhere now! Let’s try the Rorschach painting from two outputs that are very different and don’t really show anything, truly.
The frame interpolation I’m using appears to try to move the pixels as well as fade-in/opacity new content along the shifting pixels. Or something like unfolding an origami paper.
Let’s use the Google Colab animation tool a little. I played around with understanding it more since the apple gif posted here and came to make this animation. The difference between this animation and the previous is that it is entirely done in interpolating inside of Stable Diffusion so the results are much more creative than using existing pixels between frame interpolation.
I told the prompt to start at Rorschach, make a bird, then a fish, and return to the original prompt.
I don’t see the bird much, but the fish jumps out of the water very well!
I also attempted and will continue to reattempt building something similar to this Google Colab using only the cpu for image to image but as the DK30 is coming to a close I’m going to press onward.
Time to select a children’s story and start generating images for it to build the composition and narrative of my “comic book”.
I have yet to play more with ebsynth to generate moving artwork with keyframes. That may fall outside this DK30 project at this stage.
Resources:
- Frame Interpolation: https://huggingface.co/spaces/akhaliq/frame-interpolation
- Diffuser Type Comparison: https://www.youtube.com/watch?v=N5ZAMa3BUxc
- Build CPU-ONLY Stable Diffusion with GUI & Img2Img: https://github.com/darkhemic/stable-diffusion-cpuonly
- How-to do the above (incomplete) https://www.youtube.com/watch?v=mTrA4uDiKzU&t=644s
- Ebsynth - Making your artwork move: https://ebsynth.com/faq.html
HOLD UP! We’ll get back to the animation real soon.
From diving into this not only is an upscaler like ESRGAN viable we also have face restoration!
Introducing GFP-GAN
So I made some samurai portraitsto try it out on.
Before (left - from Stable Diffusion) ::: After (right - GFP-GAN & upscaling outside of face)
Less so stylized:
Try out on a face tilted to the side:
And again on existing art we’ve seen before in this project:
The model is trained on real world faces - and it create real world faces. So I’m not sure it’ll make for the best artwork matching the style as some have artificats and noises. Overall a nice tool, and once again on Google Colab so easily accessible.
Resources:
- Fix faces GFP-GAN (Google Colab) : https://www.youtube.com/watch?v=RXWpz7UBYM4
- GFP-GAN: https://xinntao.github.io/projects/gfpgan#materials
- Face tilt (PHOTOSHOP): https://www.youtube.com/watch?v=JsThZZkdBsg
- GIMP 2.10 Warp & Cage Transform: https://www.youtube.com/watch?v=hSTEeQkVFMg
- Background Removal Helper: https://www.youtube.com/watch?v=RVDsVe3nJ1U
- Upscaler - like ESRGAN: https://www.topazlabs.com/gigapixel-ai
- Photo Editing Tool (iPad): https://affinity.serif.com/en-us/photo/ipad/
- Face Restoration (Chinese Tool) : https://arc.tencent.com/en/ai-demos/faceRestoration
This time I’m providing the resources up front.
Incredible resources:
- Stable Diffusion using Google Cloud servers [limited time every 24hours]: https://colab.research.google.com/github/deforum/stable-diffusion/blob/main/Deforum_Stable_Diffusion.ipynb
- Learning to make better faces with the prompt: https://www.youtube.com/watch?v=QhxG55A2y0U
- Attempting Reinstall locally with Anaconda: https://www.youtube.com/watch?v=iwHfsDTD8U0
When looking to reinstall I came across the first link for Google Colab
Google Colab allows some time to run Stable Diffusion online using Gogle Cloud servers - I’ve tested it out on a chromebook. Easy and has better options than my local copy. I still am pursuing a local CPU only solution since I am leery of online provided materials and always want full access to things without change. Those efforts are unerway but weren’t as straightforward and accessible as the Google one.
Let’s see some of the benefits. Check this out:
I can run this from a chromebook.
Not restricted to my VRAM limit of 8GB I can generate 1024x1024 images:
I can do image 2 image and change it with a prompt:
Here’s a drawing of a red bridge and I told the program to give it a “by van gogh” appearance after adjusting some strengths.
Original:
Generated:
Then I tried whatever “animation” is. Appears to be additive previous image and scaling/movement which is sick. I wonder if I slide it instead of zoom it.
I can try this with some Rorschach Inkblot picture and get some creative designed animations I think. Also can try translating (sliding) to the left and trying a landscape / panorama and see what happens. Animation!
Will start to “create” a story… or possibly use a story already and try to retell it using AI generated artwork. That seems smarter so I don’t have to reinvent writing. Bring the project to life!
Here I wanted to make a purple blur passing by the screen without doing further edits in preparation for using stable diffusion in some part of the narrative process.
I prompted the system for a few hundred of "purple blur"s and selected a few that felt like they’d go together from the results, removed the colors except for purple and then put it into a gif.
This is how some of the individuals looked… This one has less movement and more zoom. I decided on using generated images that had more sideways motion.
Something like this was looked for (next image). Then I removed the colors outside of purple.
And put them altogether and that’s how I got this:
I like it, though I’d remove some frames from it and it may not work in the greater context. I had this after thought to maybe just tweak the prompt with punctuation for small changes. That had a diffferent effect than I expected.
I thought it’d make different lines and artifacts, but instead the picture started interpretting something else itself from the images and blurs. Huh and Hmmm.
.
I told it to make some artwork. Then I selected some I thought I saw an image. Then I described a bit of the image I was trying to do. Here’s three results.
The Art:
I told it to make an elk face out of it…
I think it turned out well. Although it didn’t keepp the exact position I expected for the ears. Next…
The Art:
I asked for an angel from the piece.
It grabbed the human form and wings but not exactly in the same crouched position. Next…
The Art:
I added “red panda” to the prompt.
Cute. Similar strokes, I think. But not sure much carried over. Next…
The Art:
I tried to get it to make it more human but it spat out a very similar image to the first - likely because it was human to begin with so I described the piece more asking for glasses, blonde hair, a woman, …
…and later a cigarette and blue eyes.
This one was quite different so I’m not confident that this method of taking something and making “what-you-see” out of it will work, bbut I’m planning to use it.
A value I had yet to play with was the “guidance factor”. This is meant to be how strongly the network adheres to the prompt but I’m not sure what “strength” means in this case.
Here is how an image of a samurai looks in practice. The guidance value is the number at the top-left. [These images animated are all 10 iterations with the only change being the guidance value]
It is recommended values between 7 and 8.5. To me though I like 8.5 and 16.5. Values above 30 appear very distorted and abstract. I felt that the higher the guidance factor it doesn’t adhere well to the prompt. So maybe this is a bell curve situation.
My next enquiry was whether the guidance factor has much to do with generating an injection of “noise” back into the image to alter it. So I ran iterations from 50 to 100.
I didn’t notice anything out of the ordinary. The analysis appears to aim, still, at the same goal. Huh. Here are those findings in picture form.
Okay there is that little difference. Let’s just look at it as it progresses through 50 iterations.
This one has a guidance factor of 8.5
This one has a guidance facotr of 16.5
Those are my findings. Still want to “re-install” and try to get image to image, infill operating, but Let’s play around with this for a little setting our own goals. After all I’m aiming to make some semblence of a narrative in the form of a “comic book” style. …
Sharing some more content tested.
I mentind I think that creating 768height images causes a (known) problem at the moment and uses another program sampling, ESRGAN to upscale.
Here is how a 768x768 image returns on my AMD graphics.
It’s beautiful, modern art.
And here is a test on 512x1080 to avoid the height complication:
It looks kind of like a relationship between a bunch of small images. Though I’m not certain that is what is happening. This may be simply induced “noise” to start the seed. It’s possible these just need more iterations than 50 that they were at, but I doubt that.
Moving on I decided to try a simple subject prompt on a 512x768 canvas.
prompt = “castle, artstation by Dominik Mayer, epic battle, artwork, artstation”
(These are four different picture seeds)
WOW, instead of the tiling, the artwork made a grand castle very daringly. Perhaps castles are in landscape shots which it learned on. Or it is tilling and it’s just a useful subject for it. I want to add that in the back of a battle subject. Let’s add “((samurai))” to the prompt after “epic battle”
Not bad. Not great on the castle. More to be desired. Now with more focus on the battle…
Sick style but… …we lost the castle completely.
Let’s apply a little of that hypothesis…
Hypothesis&Resolution: To tweak we find an output somewhat desired and then adapt the prompt iteratively.
Here I made a prompt with “tango” as the first token and then added a word to it. The top-left image is the original. Then I made a “charcoal and sketch” (top-middle) and “by van gogh” (top-right). Afterwards I did variations on “<< with/has/includes/holding/etc >> a cat” to see:
- How the picture changes
- Which word about a cat has an affect as intended
Here are two SEEDS results across an ITERATION of 10 for each image.
Neat. That seemed to work and in different ways for each. Some even came out more artistic than real.
Moving on from Cat’s and attempting to narrow down focus using parentheses “(” and “)” to tell the program to focus on that token keyword I created this
prompt = “epic battle , samurai”
with various levels of parenthesis around “samurai” to see how the focus of the generative images adapts. In the picture this means that, starting at frame #2 is the existence of the “samurai” token and then frame #3 is “(samurai)” and #4 is “((samurai))”.
That actually improved. Lastly I tried using brackets “[” and “]” for a negative prompt but that did not work well in my version. To test I tried to remove colors using color keywords like “red” , “green” , “blue”.
I also tried ratios - which is another way to apply hybrid weights in the prompt with "subject:0.25, subject2:0.85, subject3 " but this didn’t change the image beyond normal.
Let’s see what more we can do yet! Resources:
- https://github.com/invoke-ai/InvokeAI/blob/development/docs/features/PROMPTS.md#negative-and-unconditioned-prompts
- https://youtu.be/WwX_7HIJBTM?t=224
- https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Negative-prompt
From these that should work but again this is a new space and designed first for NVIDIA whereas Im using a AMD graphics. Hopefully the reinstallation will make for a better GUI presentation.
To round off the horse section of our Stable Diffusion journey I tried with a different Diffusion.
Using the resources, the same tutorial extends with using “Wiafu Diffusion” targeted and trained for cartoon-like women descriptions…but we’re going to use it for our beautiful horse prompt.
As a reminder the prompt is
prompt = “horse race, stable, horse rider, hyper realistic, unreal engine 5 render, masterpiece, shadow, artwork, artstation”
Let’s see if the same seeds return a similar result:
Picture #373
Picture #379
Picture #391
The answer is “No”. Not at all. They don’t match the results from the images created with these seeds prior. Here are a few I enjoyed:
I really enjoyed the perspective on these images , framing, and picture composition from this diffusion. Here is a list of ones I enjoyed. I’m linking it to this resource here: https://drive.google.com/drive/folders/1cdfum6j5omDU_xkbqrknG2PhHYuoqEZ_?usp=sharing (This link will not stay active forever)
In sumary: Similar to changing the size of the final resolution from 512x512 to 512x768 creates a different diffusion and image altogether. So too an entirely different image is formed from a different diffusion/diffuser. Checks out.
Hypothesis&Resolution: To tweak we find an output somewhat desired and then adapt the prompt iteratively.
I’m getting tired of the pretty horses. Before I move on from horses as the prompt let’s do two things.
First a comparison on the prompt using CrAIyon (formerly DALL-E) online free version. Perhaps we can use CrAIyon to do a general check on whether the prompt will generate something - akin to what you’re looking for without spending the process and time myself.
Here are two that CrAIyon provided - overall similar to me maybe with less racing.
And more horse stables. Not bad idea to try things out on CrAIyon first though if you don’t have access to Stable Diffusion yet. I think nothing can be substituted for actually doing it in the AI generative artwork program of choice because tweaking the the prompt is important.
I’m not sure why I cannot see the images in the previous post. I’ll look into that a little later.
This update has to do with visually seeing the change in the prompt and the strength of the token words.
For this I added “cyberpunk” to the prompt. I placed it at both the first token and last token position. First and end of the prompt.
So these prompts read:
prompt = “cyberpunk, horse race, stable, horse rider, hyper realistic, unreal engine 5 render, masterpiece, shadow, artwork, artstation”
prompt = “horse race, stable, horse rider, hyper realistic, unreal engine 5 render, masterpiece, shadow, artwork, artstation”
prompt = “horse race, stable, horse rider, hyper realistic, unreal engine 5 render, masterpiece, shadow, artwork, artstation, cyberpunk”
Notice that the middle prompt we’ve seen before - and is the comparison without the token “cyberpunk”.
Picture #373
Picture #379
Picture #391
Here I think we’re seeing that the “cyberpunk” keyword token has a strong effect by simply existing in the prompt. It also has a different effect based on its position. My guess is that each token affects the tokens that are closer to them more than the ones further away. Hence some horses are cyberpunk and others it changes more the setting. All-in-all however it’s a tinkering thing it seems with no clear “saying-a-keyword-this-way-will-tell-the-model-to-draw-it-that-way”. Or maybe it will if I say “cyberpunk background”. I’ll try that another time.
Possibly a good prompt, (and to avoid tilling) has the subject first, followed by the background or with clear terms “in the background”.
I’ll try seeing what some verbage words add another time. For now I may move forward with attempting re-installing this to use a localhosted web appearance. I’m learning that my current installation on an AMD graphics card has limitation to not yet having models for image in-filling, masking, and possibly not as easy with image to image which are all cool concepts I’d like to additionally explore and is the next leap in the project. I’m not sure if that is AMD related or simply the guide I followed in installation.
Let’s find out.
[While I do that I have some other images from my attempts with minor interesting points of comparison I may post to update with.]
In this post we’re comparing the prompt and how adding to it affects the image overall.
A prompt has a delimiter between tokens of a “,” (comma). The overall prompt is the same below across three different “seed” versions.
prompt = “horse race, stable, horse rider, hyper realistic, unreal engine 5 render, masterpiece, shadow, artwork, artstation”
Each image I’m appending another part of the prompt. These are called “tokens” in some circles. Said another way a new token is added to the generative image to use as a connection to make the image.
So the first image is
prompt = “horse race”
The next image is
prompt = “horse race, stable”
The next next image is
prompt = “horse race, stable, horse rider”
So on and so forth until the prompt no longer has another token to add. Here were the results, each using the same amount of iterations (i.e. quality of image/when to stop iterating):
Picture #373
Picture #379
Picture #391
. I’ve noticed that certain shapes remain as the prompt adapts. I’ve also noted that the material didn’t change significantly with the tokens I used after a certain point. This may be a result of not-strong token names and/or combination with the token being further than the first.
We’ll play around with that a little as I try to “prompt-engineer” so I can create a better request from the AI model. Knowing how to create the prompt is important to getting my intended results.
Let’s play around with these dials and parameters in a visual way with a focus on learning how to construct a prompt better understanding of the programs’ biases.
To start. I am taking and testing on the 512x512 images for a couple reasons.
- It is faster to iterate
- The models were trained on 512x512
This does mean that I may not see issues involved in the prompt for the 512x768 style images: namely “tilling” or adding in the subject multiple times to fill “perceived empty space” empahsised. ??? let’s find out.
Easiest is examining the number of iterations. I tried to use my judgment - whether correct or not, to select ones with slightly different angles/subject/landscape matter to see if that mattered on the iterations. Let’s see…
I don’t think the choices I selected had too drastic of a change over the iterations. I did notice some added tilling as the iteration in the early stages. Adding a subject. Not sure yet on how to add a “negative prompt” to remove certain “seen” things in the image - at least not with AMD graphics card.
If you slow down the GIFs above… To me 10 iterations yields a similar to final result image that is good enough. Even some lower values show close to the final result which will save on iteration time.
This is also possible though becuase the program is working toward approving the image through the “guidance factor” I believe - or put another way it is not adding as much NOISE / CHAOS each iteration or enough NOISE / CHAOS to change the image drastically beyond the first few iterations which start with high NOISE / CHAOS. I will look into playing around with that. NVIDIA calls it the CFG value but I do not see “CFG” worded yet.
… … Next I wanted to do something simialar to what this person has done in the YouTube link below for understanding the prompt. I’ll do something simple to try it out myself.
How changing the prompt in major and minor ways affects the result: https://youtu.be/c5dHIz0RyMU?t=97
Currently I want to examine and do comparisons on the prompt and have visible “what-happens-when-I-touch-this-value” style of discourse.
So I’ve updated the script to do some batch operations. [Real quick note that this is on AMD. NVIDIA graphics cards have seemingly better support at the moment. I get thesense that it was simply developed on an NVIDIA card first.]
As I had hoped I’m able to run this automatically and in batch operations on multiple prompts or the same prompt.
Here is the prompt I used for the following cases:
prompt = "horse race, stable, horse rider, hyper realistic, unreal engine 5 render, masterpiece, shadow, artwork, artstation"
I updated the script to read as follows so that it loops and renames the files.
for i_count in range(0, 50, 1): #Loop from 0 to 50 in steps of 1
#randnumber = random.random() * 10000
#randnumber = i_count
randnumber = 88
torch.manual_seed(randnumber)
image = pipe(prompt, height=512, width=768, num_inference_steps=i_count, guidance_scale=7.5, eta=0.0, execution_provider="DmlExecutionProvider")["sample"][0]
#image.save("10s" + str(randnumber) + "_" + prompt[0:100] + ".png")
#image.save("10s" + str(randnumber) + ".png")
image.save("" + str(randnumber) + "_" + str(i_count) + ".png")
Some of those lines are commented out depending on how I want the parameters randomized and the image name saved. A comment looks like “#”
…Great so what does it do?
I’ve given it the same prompt but adjusted the width. From some quick tests and reading. It appears that…
- Extending the height leads to garble
- The graphics card I have with 8GB VRAM cannot go up to x1080width.
- You have to go in multiples of 8
- Speed depends on what else you’re using your graphics card for.
- The model for the AI was trained on 512x512 images. To upscale we will use another AI model called ESRGAN. (I did some preliminary exploration on that and willshow that off later)
Not sure I’ll keep these links active forever but here’s what the automation made with that above prompt:
512x512: https://drive.google.com/drive/folders/1RjEznXbukpvtEKCK_Odh8VAlF-vuaOgj?usp=sharing
512x768: https://drive.google.com/drive/folders/1ykMyYxc3_TuuWaPzK8NLSZZpbePoaX-C?usp=sharing
Horses are beautiful. Here are a few of my selections:
Hmm, I’ve red that the subject being drawn “twice” is a by-product of the prompt not describing the background in great detail and or the width not having enough material to put in so the tool puts the subject in again. I’m not sure this example is great since horses race each other and it is common for there to be multiple subjects.
Although horses are pretty I may change my prompt. No “promptmises”. Hihihihilol
Looks much better at a distance than closer. We’ll see what we can do about that in a future update. To be clear - I don’t know. My preliminary exploration only went so far.
And we’re off. The horses have left the stable.
I’m spent some time in advance setting up and running Stable Diffusion locallay on an AMD Graphics card. I’m linking a week’s worth of trial and preparation for this DK below. We’ll see where this goes. Everyday I’ve found some other outlet of Automated or Artificial Intelligence.
It’s been an amazing grab-and-go experience. I’m going to have to find a way to post images here using a URL. I’ll ask the community.
I’ve discovered and used the following resources:
- https://www.youtube.com/watch?v=Lk2syPsVMQM
- https://youtu.be/Z9PFh2Us3vk
- https://youtu.be/Th-0oZjpDtk
- Hugging Face (for the AI trained models): https://huggingface.co/
How it works / Learning Resources:
- https://www.youtube.com/watch?v=1CIpzeNxIhU&t=668s
- https://stability.ai/blog/stable-diffusion-public-release
- Stable Diffusion Discord: https://discord.gg/stablediffusion
And have sights to use it in projects similar to:
- Stable Diffusion across images in a movie: https://youtu.be/t9Qim_xKT_I
- Using a video to drive a 2D artwork keyframed animation: https://ebsynth.com/
- Image-to-Image
- I would like to train my own model. I expect failure so I may be surprised if it is easy enough.
I want to get some comparison images. I’m constantly running tests and learning how to better create the prompts and attempting to replace parts of the system with other better understood ones.
Estimated Timeframe
Oct 19th - Nov 19th
Week 1 Goal
Setting Stable Diffusion Up Getting a base line. Testing prompts Look into more.
Week 2 Goal
Increasing Resolution Increasing Iteration Increasing Automation Running Automated
Week 3 Goal
Create “story” Generate pictures Use GIMP2.10 or otherwise to bring the story to life
Week 4 Goal
Polish & Finish