Stable Diffusion XL 1.0

Daily Digest email

Get the top HN stories in your inbox every day.

blagie

It's often said porn drives technology.

I clicked through the links in the article, since they sounded technically interesting. They led to AI-generated porn. Those, in turn, led to pages about training SD to generate porn. Now, two disclaimers:

1) I am not interested in AI-generating porn

2) I haven't followed SD in maybe 6-9 months

With those out-of-the-way, the out-of-the-box tools for fine-tuning SD are impressive, well beyond anything I've seen in the non-porn space, and the progress seems to be entirely driven by the anime porn community:

https://aituts.com/stable-diffusion-lora

10 images is enough to fine-tune. 30-150 is preferred. This takes 15-240 minutes, depending on GPU. I do occasionally use SD for work. If this works for images other than naked and cartoon women, and for normal business graphics, this may dramatically increase the utility of SD in my workflows (at least if I get around to setting it up).

I want my images to have a consistent style. If I'm making icons, I'd like to fine-tune on my baseline icon set. If I'm making slides for a deck, I'd like those to have a consistent color scheme and visual language. Now I can.

Thanks creepy porn dudes!

The other piece: Anyone trying to keep the cat in the bag? It's too late.

dragonwriter

> the progress seems to be entirely driven by the anime porn community:

Its not entirely driven by porn communities, and the porn communities driving it aren’t entirely anime porn communities (and the anime communities driving it aren’t entirely porn communities.)

But, yeah, the anime + porn/fetish art + furry + rpg art + scifi/fantasy art communities, and particularly the niches in the overlap of two or more of those are, pretty significant.

> If this works for images other than naked and cartoon women

It does, and while it may not be large proportionally compared to the anime-porn stuff, there’s a lot of publicly distributed fine tuned checkpoints, LoRas, etc., demonstrating that it does.

AuryGlenz

It absolutely works for things other than naked and cartoon women. Here are some generations of my daughter and dog (together!). I believe most of these are from a fine tuned model of them and not an extracted LoRA, though I use that sometimes too: https://imgur.com/a/naHgnel

squeaky-clean

The space one without headphones is particularly cool.

I use it for D&D art generation. I can have a piece of art that somewhat matches every location/scene I have planned. If things don't match my plans I can generate 8 images and pick the best in about 2 minutes. I talk to a lot of other DMs who use it in a similar way.

It's not great with specific details, I plan to commission someone to draw the party when the campaign is over. But for things like a fantasy magic shop with potions, or a fantasy dungeon exterior, or a forest of mushroom trees, it's more than good enough for concept art to throw into Roll20. I couldn't afford 5-10 pieces of custom concept art per game, nor could I come up with the ideas for them 2 hours beforehand and have them ready for the session.

ilaksh

Have you tried XL as a test for handling specific details?

blagie

> Here are some generations of my daughter and dog (together!).

I will choose to intentionally misread that :)

mkaic

The official blog post from Stability is finally up and would probably be a better URL to link to than the TechCrunch coverage: https://stability.ai/blog/stable-diffusion-sdxl-1-announceme...

MasterScrat

It’ll be "released" once the model weights show up on the repo or in HuggingFace… for now it’s "announced"

It should appear here at some point, currently only the VAE was added:

https://huggingface.co/stabilityai

quartz

Isn't this it? https://huggingface.co/stabilityai/stable-diffusion-xl-base-...

MasterScrat

Yes, it's now been released

fernly

It only replies, "module 'diffusers' has no attribute 'StableDiffusionXLPipeline'"

ftufek

The release event is in like ~30 minutes on their discord, probably the announcement went out a bit early.

nickthegreek

It does appear to be live on Clipdrop.

https://clipdrop.co/stable-diffusion

undefined

[deleted]

undefined

[deleted]

naillo

You get access to the weights instantly if you apply for them. It's basically not a hurdle.

(I've been having fun with this for a few days. https://huggingface.co/stabilityai/stable-diffusion-xl-base-... Not sure there's much of a difference with the 1.0 version.)

MasterScrat

For 1.0? Where do you apply? Or are you talking about 0.9?

Ukv

The ones you can apply for access to are the 0.9 weights, which have been available for a couple of weeks. Unless the SDXL 1.0 weights are also available by application somewhere that I'm unaware.

taminka

https://huggingface.co/stabilityai/stable-diffusion-xl-base-... :)

accrual

It sounds like after the previous 0.9 version there was some refining done:

> The refining process has produced a model that generates more vibrant and accurate colors, with better contrast, lighting, and shadows than its predecessor. The imaging process is also streamlined to deliver quicker results, yielding full 1-megapixel (1024x1024) resolution images in seconds in multiple aspect ratios.

Sounds pretty impressive, and the sample results at the bottom of the page are visually excellent.

Tenoke

They have bots in their discord for generating images bases on user prompts. Those randomize some settings, compare candidate models and are used for rlhf fine-tuning and that's the main source of refining which will continue even after release.

dragonwriter

There were, IIRC, three different post-0.9 candidate models in parallel testing to become 1.0 recently.

amilios

I always wondered why the vision models don't seem to be following the whole "scale up as much as possible" mantra that has defined the language models of the past few years (to the same extent). Even 3.5 billion parameters is absolutely nothing compared to the likes of GPT-3, 3.5, 4, or even the larger open-source language models (e.g. LLaMA-65B). Is it just an engineering challenge that no one has stepped up for yet? Is it a matter of finding enough training data for the scaling up to make sense?

airgapstopgap

Diffusion is more parameter-efficient and you quickly saturate the target fidelity, especially with some refiner cascade. It's a solved problem. You do not need more than maybe 4B total. Images are far more redundant than text.

In fact, most interesting papers since Imagen show that you get more mileage out of scaling the text encoder part, which is, of course, a Transformer. This is what drives accuracy, text rendering, compositionality, parsing edge cases. In SD 1.5 the text encoder part (CLIP ViT-L/14) takes a measly 123M parameters.[1] In Imagen, it was T5-XXL with 4.6B [2]. I am interested in someone trying to use a really strong encoder baseline – maybe from a UL2-20B – to push this tactic further.

Seeing as you can throw out diffusion altogether and synthesize images with transformers [3], there is no reason to prioritize the diffusion part as such.

1. https://forums.fast.ai/t/stable-diffusion-parameter-budget-a...

2. https://arxiv.org/abs/2205.11487

3. https://arxiv.org/abs/2301.00704

ShamelessC

> Seeing as you can throw out diffusion altogether and synthesize images with transformers [3]

That’s actually how this whole party got started. DALL-E (the first one) was a transformer model trained on image tokens from an early VAE (and text tokens ofc). Researchers from CompVis developed VQGAN in response. OpenAI showed improved fidelity with guided diffusion over ImageNet (classes) and subsequently DALLE2 using pixel space diffusion and cascading up sampling. CompVis responded with Latent Diffusion which used diffusion in the latent space of some new VQGANs.

The paper you mention is interesting! They go back to the DALL-E 1 method but train two VQGAN’s for upsampling and increase the parameter count. This is faster, but only faster than originally reported benchmarks using inferior sampling methods for their diffusion. I would be curious if they can beat some of the more recent ones which require as few as 10-20 steps.

They also improve on FID/CLIP scores likely by using more parameters. This might be a memory/time trade off though. I would be curious how much more VRAM their model requires compared to SD, MJ, Kandinsky.

The same goes for using T5-XXL. You’ll win FID score contests but no one will be able to run it without an A100 or TPU pod.

airgapstopgap

> The same goes for using T5-XXL

Is this still true in 2023? Sure, back in the dark ages it seemed like a 860M model is just about the limit for a regular consumer, but I don't see why we wouldn't be able to use quantized encoders; and even 30B LLMs run okay on Macbooks now.

Etherlord87

> Images are far more redundant than text.

"A picture is worth a thousand words" - I wonder how (in)accurate this popular saying turned out to be? :D

elpocko

I'm gonna go ahead and say in 2023, one detailed picture (512x512) is worth about 30 words.

naillo

They often reference this paper as the motivation for that https://arxiv.org/pdf/2203.15556.pdf I.e. training with 10x data and 10x longer can yield as good models as a gpt-3 model but with fewer weights (according to the paper) and the same principle applies in vision.

brucethemoose2

Diffusion is relatively compute intensive compared to transformers llms, and (in current implementation) doesn't quantize as well.

A 70B parameter model would be very slow and vram hungry, hence very expensive to run.

Also, image generation is more reliant on tooling surrounding the models than pure text prompting. I dont think even a 300B model would get things quite right through text prompting alone.

amilios

Hmm this is a good point, diffusion requires several (many?) inference passes as you refine the noise into an image, right? Makes sense that this is more expensive to scale up. Thanks for the explanation!

brucethemoose2

Technically the llms require a pass for each token, but the passes are cheaper and benefit more from batching.

vitorgrs

Do we know the amount of parameters Dall-e have these days, Firefly or Midjourney, etc?

If we are talking about Stable Diffusion, the reality is that... more parameters mean it will be hard to run locally. And let me tell you something, the community around Stable Diffusion only cares with NSFW... And want local for that...

Stable Diffusion 2 was totally boycotted by the community because they... banned NSFW from there. They had now to allow it again on SDXL.

Also, more parameters mean it will be more expensive to community finetunners to train as well.

lacker

I'm out of date on the image-generating side of AI, but I'd like to check things out. What's the best tool for image generation that's available on a website right now? Ie, not a model that I have to run locally.

a5huynh

If you want to play around with Stable Diffusion XL: https://clipdrop.co

esperent

Since clipdrop hs an API is there any way to use it with ComfyUI or Automatic111 (or whatever that's called).

dash2

I just tried this and the UI is very nice (better than dreamstudio), with nice tool integration, and image quality is definitely going up with each new release. You can see a few results at fb.com/onlyrolydog (along with a lot of other canine nonsense).

the_lonely_road

https://playgroundai.com/create

Not affiliated in anyway and not very involved in the space. I just wanted to generate some images a few weeks ago and was looking for somewhere I could do that for free. The link above lets you do that but I suggest you look up prompts because its a lot more involved than I expected.

aaarrm

Any particularly useful resources for looking into prompts?

brucethemoose2

This AI Horde UI has, IMO, some really good templates and suggestions:

https://tinybots.net/artbot

the_lonely_road

I used this: https://learnwithnaseem.com/best-playground-ai-prompts-for-a...

I just took the ones I liked and then deleted out the words that were specific to that image and left the ones that were providing the style of the image. So for example on the first one I would delete "an cute kitsune in florest" but would keep "colorfully fantast concept art". Then I just added a comma separated list of the of the features I wanted in my picture. It took a lot more trial and error than I thought and adding sentences seemed to be worse than just individual words. I am sure I barely scratched the surface of interfacing with the tool correctly but the space is moving so fast its not the kind of thing I want to spend my time learning right now just to have that knowledge deprecate in 6 months.

PUSH_AX

Midjourney right? Although, discord isn't a website I guess.

knicholes

I've found https://firefly.adobe.com/ pretty good at composing images with multiple subjects. [disclaimer - I work at Adobe, but not in the Creative Cloud]

But I wouldn't say it's the "best." Just trained on images that weren't taken from unconsenting artists.

elishah

I was quite disappointed that the Photoshop generative fill stuff insists on running on Adobe's servers rather than locally. So however good it is, there are many of us who will never use it.

knicholes

Yeah-- I can only assume it's to ensure a consistent experience and to not disperse the model openly. If you have the model running locally on people's computers, it limits who can use the generative AI and opens up a ton of headache around customer support. Again, I don't work on this, but I'm familiar with generative AI and what it takes to run.

adzm

I'm actually a big fan of firefly. It has a different kind of style from the others, presumably due to its training dataset?

roborovskis

https://dreamstudio.ai/

esperent

What models does dreamstudio use? I couldn't see how to view them without logging in.

dragonwriter

Dreamstudio (and ClipDrop, also) uses Stable Diffusion, gettig new SD models generally before public release (both are owned by StabilityAI.)

hospitalJail

There are toy AI things, but there is nothing quite like Stable Diffusion running on Colab. Lots of people recommended Midjourney but that is like playing with MSpaint. If you can get Stable Diffusion going with Automatic1111, its AAA tier. Especially with Control-net, and dreambooth, but that is part 2.

Google: The Last Ben Stable Diffusion Colab

for a way to not run it locally, but get all the features.

mdp2021

> The Last Ben Stable Diffusion Colab

https://github.com/TheLastBen/fast-stable-diffusion

iambateman

Probably Midjourney, but I like Dreamstudio better.

undefined

[deleted]

mt3ck

Is there anything like this for the vector landscape?

This may just be due to the iterative denoising approach a lot of these models take but they only seem to work well when creating raster style images.

In my experience when you ask them to create logos, shirt designs, illustrations, they tend to not work as well and introduce a lot of artifacts, distortions, incorrect spellings etc.

orbital-decay

If you mean raster images that look like vector and contain arbitrary text and shapes, controlnets/T2I adapters do work for this. You could train your custom controlnet for this, too. (it requires understanding)

As for directly generating vector images, there's nothing yet. Your best bet is generating vector-looking raster and tracing it.

ilaksh

There are SD models tunes for vector like raster output. And XL has specifically focused on this use case as one of the improvements. Try SDXL 1 on Clipdrop or Dreamstudio.

cheald

A lot of people are having success by adding extra networks (lora is the most common) which are trained on the type of image you're looking for. It's still a raster image, of course, but you can produce images which look very much like rasterizations of vector images, which you can then translate back into SVGs in Inkscape or similar.

thepaulthomson

Midjourney is still going to be hard to beat imo. Comparing SD to MJ is a little unfair considering their applications and flexibility, but I do really enjoy the "out of the box" experience that comes with MJ.

jyap

Different use case.

I can run SDXL 1.0 offline from my home. I can’t do this with Midjourney.

A closed source model that doesn’t have the limitation of running on consumer level GPUs will have certain advantages.

starik36

What type of setup do you have at home? What type of GPU? MJ completes a pretty high quality photo in about a minute. Does SD compare?

squeaky-clean

I use both but StableDiffusion has better control over the workflow. With automatic111 I can generate a matrix of output based on prompt variations or parameter changes. I can also do bigger batches. And I can open multiple tabs and queue up several prompt variation matrices at once, then leave for an hour. I have a laptop rtx 2070 and a 512x768 takes about 20 seconds[0] or so. automatic111 also includes some upscaling AI once you've found the base image you want.

StableDiffusion needs you to be way more specific than Midjourney. MJ will fill in the gaps of your prompt to get a better image. SD usually won't.

MJ photos are higher quality with easier prompting IMO, but with a distinctive style. Even if you ask it to mimic some other style, it has that midjourney feel.

I mainly it for generating setting or character images for a D&D game. I use Midjourney more for characters.

[0] This is at ~25 iterations.

BrentOzar

With an RTX 4090, you can crank out several images per minute, even at high resolutions.

brnaftr361

I haven't done it in a while but I was cranking images out at 11s/output on a 3080. But it depends on your workflow, too. I started low res/low samples (32-64) and scaled up or used recursion until I got a desirable result or found a nice seed. I think I was doing 512x916 or something close to that.

elishah

With SD you have a lot of control over not just basics like image size and prompt complexity, but also things like how many iterations of which different sampler(s) get used.

So speed can vary wildly depending on how you're choosing to use it. And that's without even getting into the wide variance of hardware.

But generally speaking, it will usually be significantly faster than one image per minute.

Der_Einzige

Midjourney is destroyed by the ecosystem around stable diffusion, especially all the features and extensions in automatic1111. It’s not even close

WXLCKNO

You still have to run midjourney through discord right? There isn't even an official API. Feels like a joke.

ffadaie

Been using https://omnibridge.io Pretty stable!

hospitalJail

MJ quality is significantly worse. Everything has the Pixar look and barely follows the prompt. Its nice for a toy, but SD with Automatic1111 is miles ahead of MJ.

skybrian

I tried it in dreamstudio. Like all the other image generators I've tried, it's rubbish at drawing a piano keyboard or an accordion. (Those are my tests to see if it understands the geometry of machines.)

A couple of accordion pictures do look passable at a distance.

Another test: how well does it do at drawing a woman waving a flag?

One thing that strikes me is that it generates four images at a time, but there is little variety. It's a similar looking woman wearing a similar color and style of clothing, a similar street, and a large American flag. (In one case drawn wrong.) I guess if you want variety you have to specify it yourself?

AI models seem to be getting ever better in resolution and at portraits.

methyl

My go-to test is "elephant riding unicycle". Neither Midjourney nor Stable Diffusion XL is capable of doing this.

jrflowers

I hope someday there’s a version of this or something comparable to it that can run on <8gb consumer hardware. The main selling point of Stable Diffusion was its ability to run in that environment.

naillo

You can do this if you select the `pipe.enable_model_cpu_offload()` option. See this https://huggingface.co/stabilityai/stable-diffusion-xl-base-...

dragonwriter

> I hope someday there’s a version of this or something comparable to it that can run on <8gb consumer hardware.

Someday is today: from the official announcement: “SDXL 1.0 should work effectively on consumer GPUs with 8GB VRAM or readily available cloud instances.” https://stability.ai/blog/stable-diffusion-sdxl-1-announceme...

cmdr2

Easy Diffusion (previously cmdr2 UI) can run SDXL in 768x768 in about 7 GB of VRAM. And SDXL 512x512 in about 5 GB of VRAM.

Regular SD can run in less than 2 GB of VRAM with Easy Diffusion.

1. Installation (no dependencies, python etc): https://github.com/easydiffusion/easydiffusion#installation

2. Enable beta to get access to SDXL: https://github.com/easydiffusion/easydiffusion/wiki/The-beta...

3. Use the "Low" VRAM Usage model in the Settings tab.

liuliu

SDXL 0.9 runs on iPad Pro 8GiB just fine.

JeffeFawkes

Is this using Draw Things, or another app? Did you have to quantize the model first?

liuliu

Yeah, Draw Things. It will be submitted as soon as SDXL v1.0 weights available. Quantized model should run on iPhones (4GiB / 6GiB models), but we haven't done that yet. So no, these are just typical FP16 weights on iPad.

minsc_and_boo

I feel like this is the greatest demand for LLMs at the moment too.

It's hard to believe we're only 8 months into this industry, so I imagine we'll start seeing smaller footprints soon.

simbolit

8 months from what point?

Gpt3 is 36 months old. Dalle-e is 28 months old. Even StableDiffusion is like 11 months old.

minsc_and_boo

Fair, I should have said 8 months since the market exploded.

brucethemoose2

We already do. MLC-LLM and Llama.cpp have Vulkan/OpenCL/Metal 3 bit implementations. That can run llama 7B (or maybe even 13b?) in 8GB.

TBH devices just need more ram for coherent output though. Llama 13b and 33b are so much "smarter" and more coherent than 7B with 3 bit quant.

TeddyDD

13b q5 llama report „total VRAM used: 8321 MB” so 3bit will most likely fit into 8GB.

capybara_2020

Give InvokeAI a try.

https://github.com/invoke-ai/InvokeAI

Edit: Spec required from the documentation

You will need one of the following:

    An NVIDIA-based graphics card with 4 GB or more VRAM memory. 6-8 GB of VRAM is highly recommended for rendering using the Stable Diffusion XL models
    An Apple computer with an M1 chip.
    An AMD-based graphics card with 4GB or more VRAM memory (Linux only), 6-8 GB for XL rendering.

kristianp

Thanks for the recommendation.

As an aside, does this irritate anyone else?

"You must have Python 3.9 or 3.10 installed on your machine. Earlier or later versions are not supported. Node.js also needs to be installed along with yarn"

I don't like having to install npm when an existing dev stack (python) is already present.

jrflowers

I should clarify that by <8gb I meant “less than 8gb”, which is what SD 1.5 and 2 were able to do. I’m aware that it can run on ==8gb.

brucethemoose2

There are several papers on 4/8 bit quantization, and a few implementations for Vulkan/CUDA/ROCm compilation.

TBH the UIs people run for SD 1.5 are pretty unoptimized.

PeterStuer

Let's see wether derived models will suffer less from the 'same face actor'-model response to every portrait prompt. It's not trivial to get photoreal models not lookalike without resorting to specific, typically celeb based, finetunes.

freediver

I am completely uninformed in this space.

Would someone be kind to explain what the current state of the art in image generation is (how does this compare to Midjourney and others)?

How do open source models stack up?

Also what are the most common use cases for image generation?

orbital-decay

SDXL is in roughly the same ballpark as MJ 5 quality-wise, but the main value is in the array of tooling immediately available for it, and the license. You can fine-tune it on your own pictures, use higher order input (not just text), and daisy-chain various non-imagegen models and algorithms (object/feature segmentation, depth detection, processing, subject control etc) to produce complex images, either procedural or one-off. It's all experimental and very improvised, but is starting to look like a very technical CGI field separate from the classic 3D CGI.

sdflhasjd

For bland stock photos and other "general-purpose" image generation, DALLE-2/Bing/Adobe etc are... the okayest. SD (with just standard model weights) is particularly weak here because of the small model size.

If you want to get arty, then state of the art for out-of-the-box typing in a prompt and clicking "generate" is probably MidJourney.

But if you're willing to spend some more time playing around with the open-source tooling, community finetunes, model augmentations (LyCORIS, etc), SD is probably going to get you the farthest.

> Also what are the most common use cases for image generation?

By sheer number of image generations? Take a guess...

BudaDude

> By sheer number of image generations? Take a guess...

Cat images right?

birracerveza

Well, yes, kind of.

Catgirl images, to be precise.

liuliu

SDXL 0.9 should be the state-of-the-art image generation model (in the open). It generates at 1024x1024 large resolution, with high coherency and good selection of styles out of box. It also has reasonable text-understanding comparing to other models.

That has been said, based on the configurations of these models, we are far from saturating what the best model can do. The problem is, FID is terrible metrics to evaluating these models so like LLM, we are a bit clueless about how to evaluate them now.

GaggiX

Why do you think FID is a terrible metrics? What don't you like in particular about it?

liuliu

I overspoke. FID is a fine metrics to observe the training progress of your own model. And it correlates well with some coherency issues of generative models. But for cross model comparisons, especially for models that generally do well under FID, it is not discriminative enough to separate better / good.

undefined

[deleted]

NoMoreNicksLeft

I don't know what the use case is for other people is, but I've been playing around with book covers. This one took about two weeks, but it was my first real try and I was still learning how. Composition is a little off. The one I'm working on now is going faster (and better).

https://imgur.com/a/CxX5eYj

I've found that I rarely get a usable image completely as-is. It might take 5 or 10 generations to find something sort of ok, and even then I end up erasing the bad parts and letting it in-paint (which again takes multiple attempts). The T-rex had like 7 legs and two jaws, but was otherwise close to what I wanted... just keep erasing extra body parts until the in-painter finally takes a hint.

I was also going to do a few book covers for some Babylon 5 books, but it does so bad on celebrity faces. Looked like Koenig's mutant love child with Ernest Borgnine. Dunno what to do about that. I keep wondering if I shouldn't spend the next 10 years putting together my own training set of fantasy and science fiction art.

brucethemoose2

Midjourney may be better for plain prompts, but Stable Diffusion is SOTA because of the tooling and finetuning surrounding it.

hospitalJail

Idk Midjourny ignores prompts.

For the longest time I thought it was google imaging things and doing some photoshop to make things look like Pixar because it was so bad.

Daily Digest email

Get the top HN stories in your inbox every day.