How to Build Your Own AI-Generated Images with ControlNet and Stable Diffusion

Daily Digest email

Get the top HN stories in your inbox every day.

drschwabe

ControlNet model specifically the scribble ControlNet (and ComfyUI) was major gamechanger for me.

Was getting good results with just SD and occassional masking but it would take hours and hours to hone in and composite a complex scene with specific requirements & shapes (with most of the work spent currating the best outputs and then blending them into a scene with Gimp/Inkscape).

Masking is unintuitive compared to the scribble which gets similar effect; no need to paint masks (which is disruptive to the natural process of 'drawing' IMO) instead just make a general black and white outline of your scene. Simply dial up/down the conditioning strength to have it more tightly or fuzzily follow that outline.

You can also use Gimp's Threshold or Inkscape Trace Bitmap tool to get a decent black & white outline from an existing bitmap to expedite the scribble procedure.

fsloth

Comfy ui is really nice. The fact that the node graph is saved as png metadata actually makes node based workflows super fluent and reproducible since all you need to do to get the graph for your image is to drag and drop the result png to the gui. This feels like a huge quality-of-life improvement compared to any other lightweight node tools I’ve used.

drschwabe

Yeah the PNG embedded 'drag and drop to restore your config' is brilliant.

Reminds me of Fireworks which Adobe killed off (after putting out a decent update or two to be fair) which used PNGs for layers and meta ala PSD format.

But its more analogous to a 3D modeller suite like Blender or Maya but with theoretical feature such that you could take a rendered output image and dragndrop it back into the 3D viewport and have it restore all the exact render settings you used instantly back. That would be handy!

l33tman

You don't need to go through Gimp or Inkscape, this is built-in to the auto1111 ControlNet UI. You just dump the existing photo there and you can select a bunch of pre-processors like edge-detection or 3D depth extraction, which is then fed into ControlNet to generate a new image.

This is super powerful for example visualizing the renovation of an apartment room or house exterior.

drschwabe

Will have to play with those more thx for the headsup; I do find however for scribble outlines I like to often draw my own lines by hand instead of an auto-generated one to emphasize the absolute key areas that would not otherwise be auto-identified. Logo and 2D design for example where you may have very specifc text shape that needs be preserved regardless of contrast or perceivable depth.

gkeechin

That's for sure - I think we have seen other kind of edge detector or filter work better for differing use cases, especially around foreground images where you want to retain more information (i.e. images with small nitty-gritty details)

In this post, we just seek to showcase the fastest way to do it - and how augmentation may potentially help vary the position!

moelf

any tutorial you would recommend? I found https://comfyanonymous.github.io/ComfyUI_examples/controlnet...

drschwabe

Yeah that tutorial is decent its what I used to get going.

Note that all of the images in those comfy tutorials (except for images of the UI itself) can be dragndropped into ComfyUI and you'll get the entire node layout you can use to understand how it works.

Another good resource is civit.ai and specifically look for images that have a comfy UI embedded metadata. I made a feature request that they create a tag for uploaders to flag comfyUI pngs but not sure if they've added that yet. Or caroose Reddit or Discord for people sharing PNGs with comfy embeds.

Trying out different models (also avail from civit) is a good way to get an understanding of how swapping out models affects performance and the results. I've been abusing Absolutereality (v1.81) + More Details LORA because its just so damn fast and the results are great for almost any requirement I throw at it. AI moves so fast but I don't even bother updating the models anymore there is just so much potential in the models we already have; more pay off would be mastering other techniques like the depth map Control Net.

I would say that above all extensive familiarity with an image editor like Photoshop, Gimp, or Krita - will get you the most mileage particularly if you have specific needs beyond just fun and concepting. AI art makes artists better, people who struggle with image editing will struggle to maximize this new tech just as people who struggle with code will have issues maintaining the code Copilot or ChatGPT is spitting out (versus a coder who will refactor and fine tune before integrating to the rest of their application).

shostack

Is there any solution for consistency yet that goes beyond form and structure and gets things like outfits, color, and facial features consistent in an easy way to compose scenes with multiple consistent characters?

dragonwriter

LoRA for specific items/characters + regional prompting covers a lot of that area.

rvion

I'm building CushyStudio https://github.com/rvion/cushystudio#readme to make Stable Diffusion practical and fun to play with.

It's still a bit rough around the corners, and I haven't properly launched it yet, but if you want to play with ControlNets, pre-processors, IP adapters, and all those various SD technologies, it's a pretty fun tool ! I personally use for real-time scribble to image, things like this :)

(will post that properly on HN in a few days / week I think, when early feedback will have been properly addressed)

bavell

Looking forward to your launch, I found cushystudio awhile back (maybe from HN?) and cannibalized some of the type generation code to make my own API wrapper for personal uses. Thanks!

I barely got it working in that early alpha but it was super helpful for me as a reference. I'll give it another go now that it's further along, it seemed very promising and I liked your workflow approach

imranhou

The versatility of Stable Diffusion, especially when combined with tools like ControlNet, highlights the advantages of a more controlled image generation process. While DALL-E and others provide ease and speed, the depth of customization and local processing capabilities of SD models cater to those seeking deeper creative control and independence.

ChatGTP

It is interesting isn't it? Because we have "AI" generating the image, but we still seem to want to "paint" or have control over the creative process.

Prompts seem to be a new type of camera, lens or paintbrush.

barrkel

There's at least three "levels" you can consider with image generation: composition, facial likeness and style. Prompts are pretty weak at composition and are the strongest point of controlnets - they do a great deal to make up for the weakness. But there are some compositions SD can't find even when given detailed controlnets.

Style generality is frequently lost in fine-tuned models. The original dreambooth tried to get around this by generating lots of images of the class to retain generality, but it's time intensive to generate all the extra images (and ideally do some QC on them) and train on them too, so it's not often done.

Magi604

SD outputs have an "uncanny valley" type of quality to them. You just KNOW when an image is from SD. And I have looked at getting started with SD, but the requirements and setup and +-prompting "language" just kind of turned me off the whole thing.

Whereas with DALL E you can get some hyper-realistic images from it with very little effort using plain human language.

I guess my point is to ask whether SD is worth bothering with at this time when DALL E and Imagen and possibly others are just on the brink of becoming mainstream and just going to get better and better. Clunking together something with SD seems unnecessary when you can generate more results, better results, in a faster way, with less requirements, and without the steep learning curve, by using other methods.

jyap

One major benefit and the reason why I use the StableDiffusion tools and models is because I can run them at home on my relatively old NVIDIA 2080 GPU with 8GB of VRAM. Costs me nothing (besides electricity).

Depends if you value this kind of freedom in life.

You can do some things such as colorizing black and white images with the Recolor model.

https://huggingface.co/stabilityai/control-lora

NBJack

I have to agree at how convenient and (long term) inexpensive this can be. I may not always get the greatest results right away, but it is fun to come up with some ideas, put them into a prompt iterator (or matrix), and run it overnight. I can tweak it to my heart's content.

gkeechin

Very interesting - thank you for sharing this. Would love to explore this as a team and perhaps put out a blog on helping others get started with control-lora

Magi604

I mean, I'm running DALLE 3 on a browser from an old laptop and I've generated probably over 15k images in 2 weeks, spanning the gamut from memes to art to lewds (with jailbreaks). The ability to completely scrap what you're building and start totally fresh at the drop of a hat with a new line of ideas and get instant results seems pretty freeing to me.

pavlov

That’s fine, but it’s like asking: “Why would anyone want to have a personal website when you can just write stuff on Facebook and Twitter and it’s so much easier?”

Stable Diffusion is an open model that you can run locally on your own computer without anyone’s permission. Dall-E is a closed model that runs on OpenAI’s very expensive server farm, and they can change how it works and what it costs whenever they please.

Right now AI is in the Uber-style expansion phase where the service is practically given away to conquer market share. Once the hypergrowth is over, OpenAI will start raising their prices just like Uber did.

Karuma

With SD I can generate at least 15k images daily on my old laptop, I can train it with new styles, characters, real people, etc.; download thousands of new styles, characters, real people, etc. from Civitai, and best of all, never worry about ever losing access to it, being censored, having to jailbreak it, being snooped on, etc.

Plus a million other tools that the community has made for it, like ControlNet or things like AnimateDiff to create videos. I can also easily create all kinds of scripts and workflows.

chankstein38

I'm using Dall-e 3 through ChatGPT but it seems to limit the amount of images I can generate per half hour. I haven't figured out the actual limit but sometimes I go to generate an image and it just says "You've reached your image generation cap please wait _n_ minutes before trying again"

Are you getting around that somehow? Even if it'll let me generate 36 images per half hour (which seems like it's probably lower than that) I can only generate 6k in 2 weeks prompting 24/7. I'm not scrutinizing your numbers I'm more hoping I'm missing some way to not have to be capped. I already pay for GPT+

DrSiemer

You just KNOW when an image is from SD

No, you know when a beginner generated an image in Stable Diffusion. With enough skill and attention, you will not.

Sure, there is a learning curve and it takes more time to get to a good result. But in turn, it gives you control far beyond what the competition can offer.

smcleod

I’m assuming you haven’t used SDXL?

Give it a go with invokeAI - you can create images that I guarantee you wouldn’t know were generated. Like anything (photography included) it’s a skill.

Examples:

  - https://civitai.com/images/2862100
  - https://civitai.com/images/2339666
  - https://civitai.com/images/2846876

ben_w

I can see at least three finger issues with the couple in the cinema.

More than that though: I use SDXL quite a bit for fun, and while I like it and it can be very good, it's still prone to getting stuck in a David Cronenberg mode for reasons I can't solve.

smcleod

Oh yeah it’s not one-shot perfect but it gets you 90% of the way there for a lot of things. I’m super impressed with it.

moritonal

At glance I get uncanny valley from two. After looking closer it's likely because with the photo of the couple at the cinema, the woman's arm around him is wearing the wrong clothes. Then photo of the guy with a hat, his neck piece is asymmetric.

TeMPOraL

That's still a 1 in 3 success rate, at the cost of writing a prompt and waiting a minute.

pmx

The eyes are messed up in the second one too which instantly gives it away

vinckr

I can run Stable Diffusion on my local machine. It is open source and weights are public, giving me in theory access to anything I want to modify.

I cant change anything on DALL E, I can just take the input or change the prompt.

Also it is a centralized service that can be shut down, modified, censored or become very expensive at any time.

NBJack

Try SDXL. Find a good negative prompt, then just put a short sentence (starting with the kind of image, such as photograph, render, etc.) describing what you want in the positive prompt. It is much simpler and has fantastic results. Tweak to your hearts desire from there.

If you see a part of the scene that looks weird (and you know what it should be) add it to your prompt. For example, if you want "photo of a jungle in South America", and the foliage looks weird, add something like "with lush trees and ferns".

brucethemoose2

Try: https://github.com/lllyasviel/Fooocus

I also recommend a good photorealistic base model, like RealVis XL.

In my experience its like DALL E but straight up better, more customizable, and local. And thats before you start trying finetunes and LORAs.

Other UIs will do SDXL, but every one I tried is terrible without all those default fooocus augmentations.

AuryGlenz

SDXL is great but it's in no way better than DALL E as far as straight text-to-image goes apart from the lack of censorship.

It has plenty of other advantages, but you can't tell it "make me a cute illustration of a 2 year old girl with Blaze from Blaze and the Monster Machines on a birthday cake with a large 2 candle on it."

DALL E will nail that, more or less. SDXL very much won't.

Zababa

Here's what I got, pasting your prompt in DALL-E 3:

- https://ibb.co/k0NCWG7 - https://ibb.co/Vm3GZcR - https://ibb.co/bvSC4w3 - https://ibb.co/VqSdYbZ

I'm surprised that it didn't complain about copyrighted characters, it tends to do that a lot for me.

zirgs

SD XL understands prompts much better than 1.5. So the next version of SD might be comparable to Dall-E without censorship.

brucethemoose2

Heh, yeah, that is true

https://ibb.co/1m0bLWC

More cherrypicking and messing with styles is getting closer, but nothing like Dall-E's first try I'm sure.

raincole

> You just KNOW when an image is from SD.

You don't. People think they do, but they don't.

fassssst

DALL-E within ChatGPT uses GPT-4 to rewrite what you ask for into a good text-to-image prompt. You could probably do something similar with Stable Diffusion with just a little upfront effort tuning that system prompt.

IanCal

Somewhat, but dalle3 is hugely better at understanding a description and relationships.

dragonwriter

LLMs in general are, and that can be leveraged by using an LLM to set up layout for Stable Diffusion.

https://github.com/TonyLianLong/LLM-groundedDiffusion

dragonwriter

> You could probably do something similar with Stable Diffusion with just a little upfront effort tuning that system prompt.

And, indeed, someone has:

https://github.com/sayakpaul/caption-upsampling

ChildOfChaos

Love all this AI stuff, would love to play more with it, but sadly I'm on a 2015 iMac, great for everything else I do but can't do this stuff.

It's pricey to get a windows machine + GPU and the cloud options seem a bit more limited and add up quickly too, but it is amazing tech.

newswasboring

I have done a bunch of stable diffusion stuff on colab. The free version works if you are lucky enough to get a GPU. Used to happen more often before. But the premium colab isn't badly priced either.

Here is a colab link to open comfyUI

https://github.com/FurkanGozukara/Stable-Diffusion/blob/main...

ChildOfChaos

They blocked this now on the free version of colab sadly.

imranq

While SD is pretty interesting, I'm curious what do people use it for? Outside of custom greeting cards and backgrounds, it's not really precise enough for conceptual art nor is it consistent enough for animation.

Filligree

Illustrating my fanfiction.

tayo42

With the luggage example it seems to only generate backgrounds where the lighting makes sense? That's kind of interesting. I was wondering how it would handle the highlight on the right.

minimaxir

Giving Stable Diffusion constraints forces it to get creative.

It’s the best argument against “AI generated images are just collages”.

Der_Einzige

This is a general result. For example, ChatGPT struggles hard with following lexical, syntactic, or phonetic constraints in prompts due to the tokenization scheme - see https://paperswithcode.com/paper/most-language-models-can-be...

LLMs + Diffusors are super charged when using techniques like constraints, controlnet, regional prompting, and related techniques.

zamalek

In ComfyUI you could run the image through a style-to-style (sdxl refinement might even pull it off) model to change the lighting without changing the content. Or use another ControlNet. Your workflow can get arbitrarily complex.

Alifatisk

If I have a large dataset or photos with my face? Can I generate my own images in different places and environments using this?

gbrits

Yep. Lora's are the easiest way to go. Loads of tutorials on Youtube. This is a good one: https://www.youtube.com/watch?v=70H03cv57-o

Alifatisk

Looks like it requires good hardware to run this? My GPU is too old for this.

loudmax

You typically want an Nvidia GPU with at least 8GB of VRAM to get started with this stuff. You can get away with less, but it will be slow going. More VRAM is better.

If you're serious about learning how to use these tools, it's far more affordable to rent a GPU in the cloud. Google Colab even has some free tiers with limited access to GPUs that are significantly more powerful than what you would normally put in a desktop.

jaggs

Artroom.ai is a great option. Free image generation feature, plus a ton of editing features like layers, zoom out etc.

ComputerGuru

The word ControlNet doesn’t appear even once in the article?

DarthNebo

They did use the Canny ControlNet Pipeline

ComputerGuru

I think the article was updated with that in response to my comment:

> P.S. As pointed out by a fellow HackerNews reader, we clearly forgot to include our code snippet for ControlNet in the article.

No other code snippet besides the one added in response uses Canny, at least so far as I can see.

DarthNebo

Oh I see, my bad

telegpt

[dead]

Daily Digest email

Get the top HN stories in your inbox every day.