To quote Wikipedia (here): “Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt.”. This of course sounds nice, but what makes it special and how can you use it?
What makes Stable Diffusion special?
The dataset
The Stable Diffusion model was trained on three subsets of LAION-5B: laion2B-en, laion-high-resolution, and laion-aesthetics v2 5+. These datasets have been scraped from the web and are available for download (here). This sets Stable Diffusion apart from for example DALL-E and Midjourney where the datasets are not publicly available.
The model
The model is publicly available here and here. Again in contrast to most of the competition. This means you can do things like add additional material for training (such as with Dreambooth) to for example change the context of an object. You can also alter the code yourself and for example disable the NSFW check or disable adding of a hidden watermark by the Stable Diffusion software.
Running locally
Also there is a large community around Stable Diffusion which create tools around the model such as Stable Diffusion WebUI or the previously mentioned Dreambooth. Since the model is publicly available, you can also run it yourself on your laptop for free and you don’t need to depend on the services of a third party to offer this as a SaaS solution.
Features
Text to image generation
You can use a prompt to indicate what you want to have created. You can give weights to specific words in the prompt. The weight can also be negative for things you don’t want to see in your output. The prompt usually contains things like the object or creature you want to see and the style. For example the below image I generated for my daughter of 5 years old;
The prompt used for the above image was;
“a beautiful cute fluffy baby animal with a fantasy background, style of kieran yanner, barret frymire, 8k resolution, dark fantasy concept art, by Greg Rutkowski, dynamic lighting, hyperdetailed, intricately detailed, trending on Artstation, deep color, volumetric lighting, Alphonse Mucha, Jordan Grimmer”
Image to text / CLIP interrogation
You can ask the model what it sees in a picture so you can use this text to generate similar images. This is called CLIP interrogation and can be done for example here.
Inpainting
You can replace a part of an image with something else. For example in the below image I’ve replaced the dog on the bench with a cat (I prefer cats).
Outpainting
You can ask the model to generate additional areas around an existing image. For example below is a picture of me. I asked Stable Diffusion to generate a body below my head.
There is even a complete web interface Stable Diffusion Infinity to help you do this on a canvas;
Upscaling
You can upscale images to add detail. This allows you to create infinite zoom effects.
This is not an actual infinite zoom but the model adds detail. If for example I upscale a low resolution image of myself, the end result will not be me but something which kinda looks like me.
How to use?
Night Cafe Studio
The easiest way to start is by playing around in Night Café Studio. For this you don’t need to setup anything locally and you can get a bit of a feel about what Stable Diffusion is and how it works. When you start to use it more often, they require you to pay but you can get some free credits daily and by participating in the community.
Running locally
If you want to run Stable Diffusion locally, you can use the following WebUI here. How to get it running on if described for Google Colab, local Windows and Mac (untested).
When you’ve started the UI, you can use the various settings to generate images;
There is also Stable Diffusion Infinity which is specialized in outpainting. You can download it here or try it online here.
You do require a suitable graphics card. An NVidia 4Gb VRAM is about the minimum. With 6Gb to generate and outpaint larger images, I was required to use the following switches in the webui-user.bat: –medvram –opt-split-attention
Challenges and limitations
Ethics
“With great power comes great responsibility” (probably by Voltaire, 1793). When the power to generate images becomes available to a large audience, there are bound to arise issues such as abuse of this technology. Some samples;
- You can alter copyrighted material, remove watermarks, upscale thumbnails or low resolution photos. Make variations which are hard to trace back to the original. This allows a person to circumvent certain online protections of images.
- You can create fake news. For example create a photo of a large audience at Trump’s inauguration.
- You can use the style of artists and their names without permission to create works of art and then compete with these same artists using these generated works. It is also currently not easy to opt-out of AI models as an artist in order to protect your work and style. You can imagine artists are not happy about this.
- It becomes easy to generate NSFW material (Google for example Unstable Diffusion). This can be abused by for example using someones Facebook pictures as base material without their permission.
Currently (03-01-2023) there are not many limitations yet fixed in legislation (for as far as I know). In the future the freedom to create or use AI models might be limited or only allowed when it conforms to certain conditions. Currently the AI world is like the start of the internet; a Wild West with few bounds.
Model limitations
- Common things are easy, uncommon things not
Less common poses (e.g. hands) and less common or highly detailed objects (e.g. crossbow) - Resolution
512 x 512 is default and the resolution the model (SD 1.5) works best at, can do multiples of 64. E.g. 578, 640, 704. Stable Diffusion 2.1 works at 768 x 768 resolution - Requires good graphics card.
E.g. Nvidia 4Gb absolute minimum, 8Gb preferable (or cloud, Google Colab) - Generation takes time and requires patience
It can take hours to generate (multiple variants of) images when running locally
Learning curve
- Setting up your environment requires some knowledge.
- Tweaking your generation configuration is not straightforward and requires you to understand a bit of what is actually happening.
- Generating prompts which create nice images is not as straightforward as you might expect. For example you need to know which artists create the style you want to generate images of. Also there are words which help such as ‘high resolution’ and negative prompts such as ‘draft’. Knowing which words to use plays a major part in generating good images.
- Establishing a workflow is important. First generation, next inpainting, next upscaling is a general way to go about this. Especially the inpainting phase takes a lot of time.
Adding all of this together makes for quite a steep learning curve before you can start creating works which are actually aesthetically pleasing.