SDXL Turbo: A Real-Time Text-to-Image Generation Model

Daily Digest email

Get the top HN stories in your inbox every day.

kmeisthax

Noncommercial use - aside from being one of my licensing pet peeves - seems to indicate that the money is drying up. My guess is that the investors over at Stability are tired of subsidizing the part of the generative AI market that OpenAI refuses to touch[0].

The thing is, I'm not entirely sure there's a paying portion of the market? Yes, I've heard of people paying for ChatGPT because it answers programming questions really well, but that's individual users, who are cost sensitive. The real money was supposed to be selling these things as worker replacement, but Hollywood unions have (rightfully) shut AI companies out of the markets where a machine that can write endless slop might have made lots of money.

OpenAI can remain in Microsoft's orbit for as long as the stupid altruist / accelerationist doomer debate doesn't tear them apart[1]. Google has at least a few more years of monopoly money before either the US breaks them up or the EU bankrupts them with fees. I don't know who the hell is pumping more money into either Anthropic or Stability.

[0] Porn. It's always porn. OpenAI doesn't want to touch it for very obvious reasons.

[1] For what it's worth, Microsoft has shown that all the AI safety guardrails can be ripped out by the money people at a moment's notice, given how quickly they were able to make the OpenAI board blink.

Jackson__

The only truly successful commercial use of SDXL I know of is by NovelAI. Said company appears to have used an 256xH100 cluster to finetune it to produce anime art.

Open source efforts to produce a similar model seem to have failed due to the extreme compute requirements for finetuning. For example, Waifu Diffusion using 8XA40[0] have not managed to bend SDXL to their will after potentially months of training.

If you need 256xH100 to even finetune the model for your use case, what's stopping you from just training your own base model? Not much, as it turns out. Developers of NovelAI have stated they'll train from scratch for the next version of their model a few weeks ago.

So I agree, even with the licensing changes things might be looking somewhat dire for SAI.

https://gist.github.com/harubaru/f727cedacae336d1f7877c4bbe2...

viraptor

What do you mean by extreme requirements? There's lots of SDXL fine tunings available at civit, like https://civitai.com/models/119012/bluepencil-xl for anime. The relevant discords for models/apps are full of people doing this at home.

Or are you looking at some very specific definition / threshold for fine tuning here?

Jackson__

There is something I find rather hard to communicate about the difference between these models on civitai and what I think a competent model should be able to do.

I'd describe them like "a bike with no handlebars" because they are incredibly difficult to steer to where you want.

For example if you look at the preview images like this one: https://civitai.com/images/3615715

The model seems to have completely ignored a good 35% of the text input, most egregiously I find the (flat chest:2.0), the parenthesis denoting a strengthening of that specific part of the prompt. The values I see people use with good general models range from 1.05~1.15. 2.0 in comparison is an extremely large value, that ended up _still not working at all_, if you take a look at the actual image.

donkeyd

I'm just going out on a limb here, but a paid service needs to be good with limited input. I've used SD locally quite a lot and it takes quite a bit of work through x/y plots to find combinations of settings that produce good images somewhat consistently. Even when using decent fine tunings from CivitAI.

When I use a decent paid service, pretty much every prompt gives me a good response out of the box. Which is good, because otherwise I'd have no use for paid services, since I can run it all locally. This causes me to go to a paid service whenever I want something quick, but don't need full control. When I do want full control, I stick to my local solution, but that takes a lot more time.

dannyw

They rented 8xA40 for $3.1k. That is actually kinda peanuts; I spent more on my gaming PC. I think there were kickstarter projects for AI finetunes that raised $200k before Kickstarter banned them?

refulgentis

Is that the correct link? I've never heard of A40s, the link is to release notes from a year and two months ago, and SD XL just came out a month or two ago. Hard for me to get to "SD XL cannot [be finetuned effectively]" from there.

Jackson__

A40: https://www.techpowerup.com/gpu-specs/a40-pcie.c3700

I have not heard about the team upgrading or downgrading from the hardware mentioned there, so I assumed it's still the same hardware they use.

>SD XL just came out a month or two ago

About 4.5 months actually.

For the SDXL cannot be finetuned efficiently claim, an attempt at a finetune was released here: https://huggingface.co/hakurei/waifu-diffusion-xl

The team was given early access by StabilityAI to SDXL0.9 for this. You'll have to test it out for yourself, if you're interested in comparing. From my experience, it is a world of a difference between the NovelAI and WaifuDiffusion models in both quality and prompt understanding.

Note, the very baseline I set for the WaifuDiffusionSDXL model was to beat their SD2.1 based model[0], which it did not in my opinion.

[0] https://huggingface.co/hakurei/waifu-diffusion-v1-4

DeathArrow

>Open source efforts to produce a similar model seem to have failed due to the extreme compute requirements for finetuning.

A distributed computing project similar to SETI @ Moon wouldn't help with training?

ozr

Not really with our current techniques. The increased latency and low bandwidth between nodes makes it absurdly slow.

dreampen

> Porn. It's always porn.

I posted about my AI porn site pornpen.ai here last year and it reached the top of the front page. And yes, it's still going strong :D (and we've integrated SDXL and videos recently)

Unfrozen0688

oh wow you just got another user sir/madam GYAAAT

Edit: okay a lot of these look just bizarre as hell and seems to favour massive grotesque breast size

I wonder the damage on women this does

orbital-decay

Emad just tweeted about the future monetization of their core models. Seems they want to use the Unity model - the original one, not the recent trick Unity pulled. AKA free to use until you make lots of money with it.

https://x.com/EMostaque/status/1729609312601887109

PeterStuer

Are you suggesting the only use for locally run free SD derived models is porn?

Creating illustrations for articles/presentations and stock photo alteratives are huge!

The ability to run for free on your local machine allows for far more iterations than using SaaS, and the checkpoint/finetune ecosystem the openness sprouted has created models performing way better for these use cases than standard SD.

kmeisthax

No, I'm suggesting that the only models you can use for porn are locally-run.

In particular the people offering hosted models do not want to touch porn, because the first thing people do with these things is try to make nonconsensual porn of real people, which is absolutely fucking disgusting. Hence why they have several layers of filtering.

SD also has a safety filter, but it's trivially removable, and the people who make nonconsensual porn do not care about trivial things like licensing terms. My assumption is that switching to a noncommercial license would mean that Stability could later add further restrictions to the commercial use terms, i.e. "if you're licensing the model for a generator app like Draw Things, you have to package it up in such a way that removing the safety filter is difficult or impossible".

emadm

We build the best video, image and other models with more downloads and usage than anyone.

It is quite revolutionary for creative industry which is a few hundred billion in size, which is a reasonable market globally.

godelski

> Porn. It's always porn.

I've been surprised at the explosion of porn. Well, not actually. Automatic1111 made that easy and anyone that CivitAI knows all too well what those models are being used for. I mean when you give teenagers the ability to undress their crushes[0] what do you think is going to happen (do laws adequately protect people (kids)? Can they? Will this force a shift towards actually chasing producers, distributors, and diddlers?)?

Porn is clearly an in demand market. But what does surprise me is that there's been a lot of work in depth maps and 3D rendering from 2D images in the past few years so I'm a bit surprised that given how popular VR headsets are (according to today's LTT episode, half as many Meta Quest 2s as PS5s have been sold?!). I mean if VR headsets are actually that prolific it seems like there'd be a good market for even just turning a bunch of videos into VR videos, not to mention porn (I don't have a VR headset, but I hear a lot of porn is watched. No Linus, I'm not going to buy second hand...). I think all it takes is for some group to optimize these models like they have for LLaMA and SD (because as a researcher I can sure tell you, we're not the optimizers. Use our work as ballpark figures (e.g. GANs 10x Diffusion) but there's a lot of performance on the table). You could definitely convert video frames to 3D on prosumer grade hardware (say a 90 minute movie in <8hrs? Aka: while you sleep).

There are a lot of wild things that I think AI is going to change that I'm not sure people are really considering (average people anyways or at least stuff that's not making it into popular conversation). ResNet-50 is still probably the most used model btw. Not sure why, but just about every project I see that's not diffusion of an LLM is using this as a backbone despite research models that are smaller, faster, and better (at least on ImageNet-22k and COCO).

[0] https://www.bbc.co.uk/news/world-europe-66877718

brucethemoose2

SDXL and ControlNet are already optimized, if thats what you mean: https://github.com/chengzeyi/stable-fast

(Note the links to various SD compilers).

But the whole field is moving so fast that people aren't even adopting the compilers and optimized implementations at large.

godelski

Not really what I mean. I mean TensorRT is faster than that according to their README. By optimized I'm specifically pointing to Llama cpp because 1) it's in C, 2) using quantized models, 3) there's a hell of a lot of optimizations in there. The thing runs on a raspberry pi! I mean not well but damn. SD is still pushing my 3080Ti for comparison.

But I wasn't thinking diffusion. Models are big and slow. GANs still reign in terms of speed and model sizes. I mean the StyleGAN-T model is 75M params (lightweight) or 1bn (full) (with 123M for text). That paper notes that the 56 images they use in the Fig 2 takes 6 seconds on a 3090 at 512 resolution. I have a 3080Ti and I can tell you that's about how long it takes for me to generate a batch size of 4 with an optimized TensorRT model. That's a big difference, especially considering those are done with interpolations. I mean the GAN vs Diffusion debate is often a little silly as realistically it is more a matter of application. I'll take diffusion in my photoshop but I'll take StyleGAN for my real time video upscaling.

But yes, I do understand how fast the field is moving. You can check my comment history to verify if register isn't sufficient indication.

undefined

[deleted]

kmeisthax

>do laws adequately protect people (kids)? Can they? Will this force a shift towards actually chasing producers, distributors, and diddlers?

It's extremely complicated. Actual CSAM is very illegal, and for good reason. However, artistic depictions of such are... protected 1st Amendment expression[0]. So there's an argument - and I really hate that I'm even saying this - that AI generated CSAM is not prosecutable, as if the law works on SCP-096 rules or something. Furthermore, that's just a subset of all revenge porn, itself a subset of nonconsensual porn. In the US, there's no specific law banning this behavior unless children are involved. The EU doesn't have one either. A specific law targeted at nonconsensual porn is drastically needed, but people keep failing to draft one that isn't either a generalized censorship device or a damp squib.

You can cobble together other laws to target specific behavior - for example, there was a wave of women in the US copyrighting their nudes so they could file DMCA 512 takedown requests at Facebook. But that's got problems - first off, you have to put your nudes in the Library of Congress, which is an own goal; and it only works for revenge porn that the (adult) victim originally made, not all nonconsensual porn. I imagine EU GDPR might be usable for getting nonconsensual porn removed from online platforms, but I haven't seen this tried yet.

I'm disgusted, but not surprised, that teenage kids are generating CSAM like this. Even before we had diffusion models, we had GANs and deepfakes, which were almost immediately used for generating shittons of nonconsensual porn[1].

[0] https://en.wikipedia.org/wiki/Ashcroft_v._Free_Speech_Coalit... and the later https://en.wikipedia.org/wiki/United_States_v._Handley

[1] https://www.youtube.com/watch?v=OCLaeBAkFAY

CaptainFever

> AI generated CSAM is not prosecutable

This is true, though "AI CSAM" is an oxymoron. There is no abuse in the creation of such works, and such it is not abuse material, unless of course real children are involved.

numpad0

Do non-consensual porn not qualify as defamation? That and obscenity laws if existed should be able to handle most hyperrealistic porn so that only speeches remain.

michaelbrave

Could this perhaps fall under something like trademark, like an unauthorized use of self, I'm sure I've heard of some celebrity cases that were for similar.

godelski

> I'm disgusted, but not surprised, that teenage kids are generating CSAM like this. Even before we had diffusion models, we had GANs and deepfakes, which were almost immediately used for generating shittons of nonconsensual porn

I think the big difference now is that 1) it's much easier to do now, and 2) the computational requirements and (more importantly) technical skills have dramatically dropped.

We should also be explicitly aware that deep fakes are still new. GANs in 2014 were not creating high definition images. They were doing fuzzy black and white 28x28 faces, poorly, and 32x32 color images that if you squint hard enough you could see a dog (https://arxiv.org/abs/1406.2661). MNIST was a hard problem at that time and that's 10 years. It took another 4 years to get realistic faces and objects (https://arxiv.org/abs/1710.10196) (mind you, those images are not random samples), another year to get to high resolution, and another 2 to get to diffusion and another 2 before those exploded. Deep fakes were really only a thing within the last 5 years and certainly not on consumer hardware. I don't think the legal system moves much in 10 years let alone 5 or 2. I think a lot of us have not accurately encoded how quickly this whole space has changed. (image synthesis is my research area btw)

I'm not surprised that these teenagers in a small town did this. But the fact that all those adjectives exist in that order is distinct. Discussions of deep fakes like that Tom Scott video were barely a warning (5 years is not a long time). It quickly went from researchers thinking it can happen in the next decade and starting discussions to real world examples making the news in under their prediction time (I don't think anyone expected how much money and man hours would be dumped into AI).

fnordpiglet

Open AI has a pretty robust and profitable business without Microsoft. In every enterprise I’ve been involved with over the last few years we have had some incredibly material and important use cases of OpenAI LLMs (as well as Claude). They aren’t spewing slop or whatever, they’re genuinely achieving valuable and foundational business outcomes. I’ve been a bit stunned at how fast we’ve achieved these things and it tells me that the AI hype isn’t hype, and that if we have done these things in a year, it’s hard to estimate how much impact the technologies will have in five but I think it’s substantial. So is our spend with OpenAI. Or rather, with Azure on OpenAI products. The only value from our experiences Microsoft offers is IAM - which is sufficient frankly.

Stability and Midjourney are also making money, but it’s largely with amateurs and people prototyping content for their own creations. A lot of single person Indy game developers are using these tools to generate assets, or at minimum first pass assets. I think a lot of media companies are producing the art for their articles or news letters etc using these tools. Whether this is enough, I don’t know.

__loam

Sigh I'm really tired of seeing people assume OpenAI is profitable. We have no idea of they are or not and have some indication that they're incinerating money on chatgpt to the point that they're turning off sign ups because they're out of compute.

refulgentis

I'm not sure that's a useful argument insomuch as A) its unfalsifiable until they IPO (reporting indicates they are very, very profitable) and B) running out of GPUs seems like an odd thing to name as an indicator you're _losing_ money.

People are genuinely way, way, way underestimating how intense the GPU shortage is.

fnordpiglet

My understanding, which I can’t prove other than to say it comes from folks affiliated with OpenAI, is that chatgpt doesn’t make money but also doesn’t lose money (in aggregate, some accounts use way more than others but many accounts are fairly idle), and their API business is profitable and accounts for most of their GPU utilization. I have no insight into why they would turn off signups for chatgpt other than they may need the capacity for their enterprise customers, where they make a decent margin.

torginus

Isn't literally every imagegen AI that's not DALL-E or Midjourney based on Stable Diffusion?

cubefox

There are exceptions, e.g. https://generated.photos/human-generator uses a GAN based model.

Edit: Also, Adobe uses its own model for Photoshop integration (inpainting via cloud). That model seems to be the same as this one: https://www.adobe.com/sensei/generative-ai/firefly.html

hospitalJail

Are we sure that those arent based on stable diffusion?

No code black box, and we get to tease the closed source companies for wrapping FOSS stuff.

Midjourny I'm most convinced is just a SD with a fine-tuned model. That would explain why everything looks like pixar and can't follow the prompt.

fragmede

Given that Midjourney predates StableDiffusion, that seems unlikely, though it is possible they threw away all their hard work to create their model to use one that's available to other people for free and then charge money for it.

minimaxir

Hugging Face released a Colab Notebook for generation from SDXL Turbo using the diffusers library: https://colab.research.google.com/drive/1yRC3Z2bWQOeM4z0FeJ0...

Playing around with the generation params a bit, Colab's T4 GPU can batch-generate up to 6 images at a time at roughly the same speed as one.

sp332

Is Euler Ancestral recommended for this? I thought the Ancestral samplers added noise to every step, preventing convergence.

albert_e

thanks for sharing

I got to experience the power of current models with just 5 lines of code

the pace of change is stressing me out :)

webmaven

Does Turbo (the model, not this particular notebook) support negative prompts?

3abiton

So the new SD model requires higher end hardware compare to the rest?

minimaxir

No, it's a smaller model than normal SDXL so it requires less hardware compared to SDXL.

thot_experiment

I've been mucking with this stuff again and the SDXL + LCM sampling & LoRA makes 1280x800 images in like 2 second, so about a ~5x speed increase for me (so this would be roughly 2x faster than LCM (??, napkin math)). I've found that the method isn't as good at complex prompts. They claim here this can outperform SDXL 1.0 WRT prompt alignment, but I'm curious what their test methodology is. I searched the paper and I couldn't immediately find how it was evaluated. I think these sorts of subjective measurements are fiendishly hard to quantify given the infinity of possible prompts. Still, exciting stuff always happening here, what a time to be alive.

htrp

the use case here is really for segmented inpainting.

don't like a part of an image? replace it instantly

thot_experiment

There are SO many use cases! I maintain we're not even scratching the surface here. You could programmatically reskin video based on crowd participation, you could re-texture VR spaces on the fly. The space of cool shit that you can do with this stuff is growing far faster than we're able to explore it right now.

__loam

[flagged]

zorgmonkey

It has been a while since I last tried, but I never had very good results when I tried inpainting with SDXL in comparison to the SD1.5 inpainting models.

gliptic

They seem to be comparing against SDXL 1.0 at 512x512 which makes no sense to me as SDXL 1.0 is horrible at 512x512.

> Using four sampling steps, ADD-XL outperforms its teacher model SDXL-Base at a resolution of 512^2 px

minimaxir

The license is non-commercial, but:

> For clarity, Derivative Works do not include the output of any Model.

https://huggingface.co/stabilityai/sdxl-turbo/blob/main/LICE...

Doesn't that mean that generated images from it should be fine for commercial?

nperez

I'm not a lawyer, but I believe there was a court ruling that says AI generated works cannot be copyrighted. So you could use them, but couldn't stop anyone else from doing what they want with them

yreg

> there was a court ruling

What jurisdiction? USA?

adventured

Yes -

https://www.reuters.com/legal/ai-generated-art-cannot-receiv...

kmeisthax

No, it just means that if they sue you, they're pre-committing to not try and foreclose on your own generated outputs by claiming they're derivatives that they would then own.

Of course, this is a water sandwich. If model outputs are derivatives of the model, it'd be difficult to argue that the model itself isn't a derivative of all the training data, most of which isn't licensed. So if anything, this covers Stability's ass, not yours. There's also the related question of if AI models - not their outputs, just the models themselves - have any copyright at all. The logic behind the non-copyrightability of AI art would also apply to the AI training process, so the only way you could get copyright would be a particularly creative way of organizing and compiling the training dataset.

Remember: while the "AI art isn't art" argument reeks to high heavens of artistic snobbery, it's not entirely wrong. There isn't a lot of creative control in the process. Furthermore, we don't give copyright to monkeys[2], so why should we give it to AI models?

"Noncommercial" isn't actually a thing in copyright law. Copyrighted works are inherently commercial artifacts[0], so if you just say "noncommercial use is fine", you've said nothing - and you've invited the legal equivalent of nasal goblins[1] into the courtroom. Creative Commons gets around this by defining their own concept of NonCommercial use. So what did Stability's lawyers cook up?

> “Non-Commercial Uses” means exercising any of the rights granted herein for the purpose of research or non-commercial purposes. Non-Commercial Uses does not include any production use of the Software Products or any Derivative Works.

Uh... yeah. That's replacing a meaningless phrase with a tautology. Fun. The only concrete grant of rights is research use, and they categorically reject "any production use", which is awfully close to all uses. Even using this to generate funny fanart mashups for your own personal enjoyment could be construed as a 'production use'. Stability could actually sue you for that (however unlikely that would be).

[0] In the eyes of the law. I actually hate this opinion, but it's the opinion the law takes.

[1] Under ISO 9899, it is entirely legal for C programs with undefined behavior to make goblins fly out of your nose.

[2] https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...

__loam

I think the artists get to say whatever they want about this bullshit considering it's entirely dependent on their work to even function. I don't think the AI community gets to call the people who make the models possible at all snobs or anything else. The AI people didn't make shit. They should have some respect for the people that do.

senseiV

The so called "AI People" built the entire architecture, something people didn't think was possible at the scale and quality a year ago, and the matter of "artists should get whatever they want" because it trained on their works isn't the point. Diffusion Models don't rip parts of pictures together, they happen to be trained to make art out of noise, finding patterns in art. same things happening with LLM's in court with the LLama model and book authors claiming it only makes books.

I still remember that piece of art that was submitted to an art contest and won, only to be announced an SD prompt later

esjeon

IANAL but it sounds more like SAI is not responsible for any outputs generated using their models and how those are used.

Severian

Works with Automatic111. Generated 20 512x512 on a lowly RTS 2070S with 8GB RAM.

Prompt: a man

Steps: 1, Sampler: Euler a, CFG scale: 1, Seed: -1, Size: 512x512, Model hash: e869ac7d69, Model: sd_xl_turbo_1.0_fp16, Clip skip: 2, RNG: NV, Version: v1.6.0

Examples: https://imgur.com/a/UuuT9qu

thot_experiment

I've done a bit of fiddling around with it and definitely holding back judgement for now, seems like the 1 and 2 step images are WAY more coherent than LCM, but the images are kinda trash for any kind of prompt complexity so you start to have to use more steps, and since the individual steps take the same amount of time (I think there's a specific sampler for this which may be faster & better?) by the time you start prompting details you end up using 4 steps and the perf is about the same as LCM, and that breaks down the same way as you start going for more complexity (text, coherent bg details etc) because you end up needing 10-15 steps and at that point you're going to get a much better result from full-fat SDXL x dpmpp3msdee (lol)

Curious to see the bigbrain people tackle this over the next few days and wring all the perf out of it, maybe samplers tailored to this model will give a notable boost.

radicality

Are you finding dpm++ 3M SDE better than dpm++ 2M SDE in sdxl?

Afaik the second order (2M) version is the recommended one to use for guided sampling vs the 3rd order one.

From here: https://huggingface.co/docs/diffusers/v0.23.1/en/api/schedul...

> It is recommended to set solver_order to 2 for guide sampling, and solver_order=3 for unconditional sampling.

Filligree

It might be placebo, but I find 3M better for upscaling, when I usually set CFG quite low and use a generic prompt that doesn't describe any localised element of the picture.

Which is what it's meant for, I suppose.

thot_experiment

In my tests it's basically been 50/50, i probably did ~40 or so comparisons when i was testing samplers and i felt like there were a couple that seemed really good on the 3rd order one, but idfk, it was very very close, I don't know if I saw a single gen where one of the two was bad but the other wasn't.

techbro92

I just want to point out that I’ve noticed sdxl isn’t good at producing images that are 512x512 for some reason. It works much better with at least 768x768 resolution.

minimaxir

Normal SDXL requires 1024x1024 output or the quality degrades significantly.

doctorhandshake

Hi Max! Thanks for all the tuts! Small correction - SDXL wants ~1 megapixel resolutions at a variety of aspect ratios.

https://github.com/lllyasviel/Fooocus/issues/24

Severian

Indeed, however one of the listed limitations is: The generated images are of a fixed resolution (512x512 pix), and the model does not achieve perfect photorealism.

Ologn

I did with Automatic1111 as well. With an RTX 4090 with 24G VRAM, steps 1, cfg scale 1, size 512x512, model sd_xl_turbo_1.0. I was generating more than four images a second.

liuliu

Instruction here for how to use it on iPhone / iPad / Mac today: https://twitter.com/drawthingsapp/status/1729633231526404400 (note that you need to convert 8-bit version yourself if you want to use it on < 8GiB devices).

donkeyd

I remember when you shared the first version on HN, I was totally impressed with what my little phone could do. It was my first step of many into the incredible world of SD. I never expected this app to be maintained this long and especially this actively. And all of this for free.

So thank you very much for your work!

PS, will you be adding Turbo directly to the app in the near future?

liuliu

> PS, will you be adding Turbo directly to the app in the near future?

I need to get some clarity from Stability given the new Membership thing. I personally see no reason why I cannot as long as I tell people these weights are non-commercial / academic only.

ilaksh

Does anyone have any idea of when/if there will be a simple way to get commercial access in one step? Like an API or something? Or if they want to charge for it, then maybe a web check out?

It's interesting that they finally decided to try to make something commercially restricted.

Do they have a watermark or anything that they can use to track down people who use without a license?

Also, is there anything like an open source (commercial allowed) effort at reproducing this?

zaptrem

Open version predates this one: https://huggingface.co/blog/lcm_lora

ilaksh

Yes but Turbo is 4 times faster.

fernly

about my 8th prompt, it told me I had used up my usage and offered an option to sign up for "pro".

ilaksh

Me too, but I need an API.

Filligree

It's... fast, yeah, but the quality is low. Not sure what I'd be using this for.

LordDragonfang

Based on the demo, that's... incredibly fast. Literally generating images faster than I can type a prompt. They've clearly got a set seed, so they're probably caching request, but even with prompts that they couldn't possibly have cached it's within a second or so.

gremlinsinc

I hope playgroundai.com adopts this asap, but not sure they can with that non-commercial bit...

undefined

[deleted]

stets

This isn't cached. I'm running it locally on a 4060 TI 16 GB and it's just as fast. Image gens in .6-.8 seconds. Each word or character I type is a new image gen and it's INSTANT.

lemoncookiechip

the clipdrop demo doesn't inspire much confidence with how bad the generations are, we're talking 2021 levels, not to mention everything is NSFW somehow, should probably work on those filters.

minimaxir

I tested a bit and the quality for photorealistic images is surprisingly bad, and definitely worse than LCM and of course normal SDXL. For more artistic images, SDXL Turbo fares better.

Unlike normal SDXL, you're required here to use the old-fashioned syntatic sugar like "8k hd" and "hyperrealistic" to align things.

dragonwriter

In my own testing (using ComfyUI), the best of the "fast gen" techniques for sdxl is using the Turbo model [0], but using the LCM sampler with the sgm_uniform scheduler (which is normal for LCM) with it, and running it up to 4-10 iterations instead of just one. I think StabilityAI demos are using Euler A with the normal base scheduler, and running a single iteration (which is cool for a max-speed demo, and its awesome for that speed, but its leaving a lot of quality on the table that you can get with a few more iterations especially with the LCM/sgm_uniform sampler/scheduler combo.) Bumping CFG up slightly helps, too (but I think adds another performance hit, because I think the demos are running at CFG 1, which AIUI disables CFG and reduces computations per iteration.)

> Unlike normal SDXL, you're required here to use the old-fashioned syntatic sugar like "8k hd" and "hyperrealistic" to align things.

That's not "syntactic sugar", and its not particular my experience that it is needed with sdxl turbo.

[0] actually, differencing the base sdxl model from the turbo model to get a "turbo modifier", and then combining that with a good SDXL-based checkpoint, because StabilityAI's base models are pretty ho-hum compared to decent community checkpoints derived from them, but that is kind of a peripheral issue.

jauntywundrkind

Are there any good resources for learning a lot of "syntactic sugar" terms? This is new to me, but I'd love to know more.

genewitch

It is completely dependent on the model. Civit dot ai has model showcases as well as fine-tune showcases, and you can click any image or press the (i) to see the generation info.

Some models like natural language prompts - "draw me a pterodactyl tanning at a beach", some prefer shorthand (danbooru style clip) - "1man, professor, classroom, chalkboard, white_hair, suit", and some work with a mixture of the above as well as the syntactical sugar -"masterpiece, 8k, trending on artstation, space image, a man floating next to a spaceship in space, bokeh, rim lighting, cinematic lighting, Nikon D60, f / 2"

Fine-tuning models - LoRA, etc, allow one to convert prompts from one style to another if they wish, but usually it's to compress an idea, style, person, object, etc in to a single "token", so you can work on other aspects of the image.

Check out civit AI and you can sort of get an idea of the cargo cultism as well as what sort of keywords actually make a difference.

swyx

https://github.com/swyxio/ai-notes/blob/main/IMAGE_PROMPTS.m...

Jackson__

Yeah, it looks like the enshittification of StabilityAI is in full force by now. Especially considering the continually worse licensing.

I expect if they ever manage to release an image gen model that's an objective improvement, lets say 80% as good as dalle3, it will be subscription API only.

BoorishBears

Are you serious?

I'm using Stability in production: they kept their SDXL beta model which was capable of SDXL 1.0 level prompt adherence at a fraction of the cost up for months after was reasonable for a one-off undocumented beta, and it was a huge boon to my product.

Then a few weeks back they went and quietly cut costs to 1/5th or so what they were for SDXL and released a model that produced similar quality outputs to SDXL for my specific usecase in a fraction of the time (SD 1.6)

They're on fire as far as I'm concerned, just quietly making their product cheaper and faster.

—

Also Dalle 3 is in a very awkward place for programmatic access, so awkward I wouldn't call them competitive to SD for many usecases: It's got a layer of prompt interference baked in, it's expensive, latency is not very consistent. Text is a cool trick but it's still not reliable enough to expose as a core part of the generation for an end user.

ShamelessC

Sounds like they’re doing the same thing OpenAI is doing. Claiming to favor open models but the reality is they’re pumping growth by reducing costs and this lowering prices. They want a massive chunk of this new market, all of it if they can get it. Their perceived valuation then becomes a matter of how many eyeballs they have looking at segments of their website to advertise to, or how many data points they can collect on their users to sell to advertisers. It’s unlikely they can capture the whole market and still make a chunky enough profit to satisfy investors if they also intend to keep prices high enough without needing to resort to enshitification.

DeathArrow

That's really great, we can generate porn faster and better. :)

iAkashPaul

From testing out the 1024px outputs it seems to work okay for text2img tasks only, other tasks get deepfried results. Also got it working using SDXL-LCM with similar results.

Daily Digest email

Get the top HN stories in your inbox every day.