Stable Video Diffusion

Daily Digest email

Get the top HN stories in your inbox every day.

btbuildem

In the video towards the bottom of the page, there are two birds (blue jays), but in the background there are two identical buildings (which look a lot like the CN Tower). CN Tower is the main landmark of Toronto, whose baseball team happens to be the Blue Jays. It's located near the main sportsball stadium downtown.

I vaguely understand how text-to-image works, and so it makes sense that the vector space for "blue jays" would be near "toronto" or "cn tower". The improvements in scale and speed (image -> now video) are impressive, but given how incredibly able the image generation models are, they simultaneously feel crippled and limited by their lack of editing / iteration ability.

Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

TacticalCoder

> Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

I feel like we're close too, but for another reason.

For although I love SD and these video examples are great... It's a flawed method: they never get lighting correctly and there are many incoherent things just about everywhere. Any 3D artist or photographer can immediately spot that.

However I'm willing to bet that we'll soon have something much better: you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.

And the scene shall be sent into Blender and you'll click on a button and have an actual rendering made by Blender, with correct lighting.

Wanna move that bicycle? Move it in the 3D scene exactly where you want.

That is coming.

And for audio it's the same: why generate an audio file when soon models shall be able to generate the various tracks, with all the instruments and whatnots, allowing to create the audio file?

That is coming too.

epr

> you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.

I'm always confused why I don't hear more about projects going in this direction. Controlnets are great, but there's still quite a lot of hallucination and other tiny mistakes that a skilled human would never make.

boppo1

Blender files are dramatically more complex than any image format, which are basically all just 2D arrays of 3-value vectors. The blender filetype uses a weird DNA/RNA struct system that would probably require its own training run.

More on the Blender file format: https://fossies.org/linux/blender/doc/blender_file_format/my...

dragonwriter

> I'm always confused why I don't hear more about projects going in this direction.

Probably because they aren't as advanced and the demos aren't as impressive to nontechnical audiences who don't understand the implications: there’s lots of work on text-to-3d-model generation, and even plugins for some stable diffusion UIs (e.g., MotionDiff for ComyUI.)

lairv

I think the bottleneck is data

For single 3D object the biggest dataset is ObjaverseXL with 10M samples

For full 3D scenes you could at best get ~1000 scenes with datasets like ScanNet I guess

Text2Image models are trained on datasets with 5 billion samples

jowday

There's a lot of issues with it, but perhaps the biggest is that there aren't just troves of easily scrapable and digestible 3D models lying around on the internet to train on top of like we have with text, images, and video.

Almost all of the generative 3D models you see are actually generative image models that essentially (very crude simplification) perform something like photogrammetry to generate a 3D model - 'does this 3D object, rendered from 25 different views, match the text prompt as evaluated by this model trained on text-image pairs'?

This is a shitty way to generate 3D models, and it's why they almost all look kind of malformed.

sanitycheck

From my very clueless perspective, it seems very possible to train an AI to use Blender to create images in a mostly unsupervised way.

So we could have something to convert AI-generated image output into 3D scenes without having to explicitly train the "creative" AI for that.

Probably much more viable, because the quantity of 3D models out in the wild is far far lower than that of bitmap images.

eigenvalue

I think this recent Gaussian Splatting technique could end up working really well for generative models, at least once there is a big corpus of high quality scenes to train on. Seems almost ideal for the task because it gets photorealistic results from any angle, but in a sparse, data efficient way, and it doesn’t require a separate rendering pipeline.

bozhark

One was on the front page the other day, I’ll search for a link

insanitybit

I assume because it's still extremely early.

bob1029

> However I'm willing to bet that we'll soon have something much better: you'll describe something and you'll get a full 3D scene, with 3D models, source of lights set up, etc.

I agree with this philosophy - Teach the AI to work with the same tools the human does. We already have a lot of human experts to refer to. Training material is everywhere.

There isn't a "text-to-video" expert we can query to help us refine the capabilities around SD. It's a one-shot, Jupiter-scale model with incomprehensible inertia. Contrast this with an expert-tuned model (i.e. natural language instructions) that can be nuanced precisely and to the the point of imperceptibility with a single sentence.

The other cool thing about the "use existing tools" path is that if the AI fails part way through, it's actually possible for a human operator to step in and attempt recovery.

whywhywhywhy

Nah I disagree, this feels like a glorification of the process not the end result. Just because having the 3D model in the scene with all the lighting makes the end result feel more solid to you because you feel you can see the work that's going into it.

In the end diffusion technology can make a more realistic image faster than a rendering engine can.

I feel pretty strongly that this pipeline will be the foundation for most of the next decade of graphics and making things by hand in 3D will become extremely niche because lets face it anyone who has worked in 3D it's tedious, it's time consuming, takes large teams and it's not even well paid.

The future is just tools that give us better controls and every frame will be coming from latent space not simulated photons.

I say this as someone who had done 3D professionally in the past.

pegasus

Nah, I agree with GP. Who didn't suggest making 3D scenes by hand, but the opposite: create those 3D scenes using the generative method, use ray-tracing or the like to render the image. Maybe have another pass through a model to apply any touch-ups to make it more gritty and less artificial. This way things can stay consistent and sane, avoiding all those flaws which are so easy to spot today.

bbor

I find that very unlikely. LLMs seem capable of simulating human intuition, but not great at simulating real complex physics. Human intuition of how a scene “should” look isn’t always the effect you want to create, and is rarely accurate im guessing

coldtea

>For although I love SD and these video examples are great... It's a flawed method: they never get lighting correctly and there are many incoherent things just about everywhere. Any 3D artist or photographer can immediately spot that.

The question is whether the 99% of the audience would even care...

COAGULOPATH

Of course they would. The internet spent a solid month laughing at the Sonic the Hedgehog movie because Sonic had weird-looking teeth.

atentaten

Whats your reasoning for feeling that we're close?

cptaj

We do it for text, audio and bitmapped images. A 3D scene file format is no different, you could train a model to output a blender file format instead of a bitmap.

It can learn anything you have data for.

Heck, we do it with geospatial data already, generating segmentation vectors. Why not 3D?

btbuildem

That indeed sounds like a very plausible solution -- working with AI on the level of scene definitions, model geometries etc.

However, 3D is just one approach to rendering visuals. There are so many other styles and methods how people create images, and if I understand correctly, we can do image-to-text to analyze image content, as well as text-to-image to generate it - regardless of the orginal method (3d render or paintbrush or camera lens). There are some "fuzzy primitives" in the layers there that translate to the visual elements.

I'm hoping we see "editors" that let us manipulate / edit / iterate over generated images in terms of those.

wruza

Not that I’m against the described 3d way, but personally I don’t care about light and shadows until it’s so bad that I do. This obsession with realism is irrational in video games. In real life people don’t understand why light works like this or like that. We just accept it. And if you ask someone to paint how it should work, the result is rarely physical but acceptable. It literally doesn’t matter until it’s very bad.

Kuinox

This isn't coming, it's already here. https://github.com/gsgen3d/gsgen Yes, it's just 3D models for now, but it can do whole scenes generations, it's just not great yet at it. The tech is there but just need to improve.

xianshou

Emu edit should be exactly what you're looking for: https://ai.meta.com/blog/emu-text-to-video-generation-image-...

smcleod

It doesn’t look like the code for that is available anywhere though?

01100011

I recently tried to generate clip art for a presentation using GPT-4/DALL-E 3. I found it could handle some updates but the output generally varied wildly as I tried to refine the image. For instance, I'd have a cartoon character checking its watch and also wearing a pocket watch. Trying to remove the pocket watch resulted in an entirely new cartoon with little stylistic continuity to the first.

Also, I originally tried to get the 3 characters in the image to be generated simultaneously, but eventually gave up as DALL-E had a hard time understanding how I wanted them positioned relative to each other. I just generated 3 separate characters and positioned them in the same image using Gimp.

btbuildem

Yes that's exactly what I'm referring to! It feels as if there is no context continuity between the attempts.

filterfiber

> Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

Emu can do that.

The bluejay/toronto thing may be addressable later (I suspect via more detailed annotations a la dalle3) - these current video models are highly focused on figuring out temporal coherence

amoshebb

I wonder what other odd connections are made due to city-name almost certainly being the most common word next to sportsball-name.

Do the parameters think that Jazz musicians are mormon? Padres often surf? Wizards like the Lincoln Memorial?

dsmmcken

Adobe is doing some great work here in my opinion in terms of building AI tools that make sense for artist workflows. This "sneak peak" demo from the recent Adobe Max conference is pretty much exactly what you described, actually better because you can just click on an object in the image and drag it.

See video: https://www.adobe.com/max/2023/sessions/project-stardust-gs6...

btbuildem

Right, that's embedded directly into the existing workflow. Looks like a very powerful feature indeed.

thatoneguy

Makes me wonder if they train their data on everything anyone has ever uploaded to Creative Cloud.

undefined

[deleted]

achileas

> Has anyone come across a solution where model can iterate (eg, with prompts like "move the bicycle to the left side of the photo")? It feels like we're close.

Nearly all of the available models have this, even the highly commercialized ones like in Adobe Firefly and Canva, it’s called inpainting in most tools.

btbuildem

I think that's more "inpainting" where the existing software solution uses AI to accelerate certain image editing tasks. I was looking for whole-image manipulation at the "conceptual" level.

achileas

They have this. Inpainting is just a subset of the image-to-image workflow and you don't have to provide a region if you want to do whole-image manipulation.

omneity

Nice eye!

As for your last question yes that exists. There are two models from Meta that do exactly this, instruction based iteration on photos, Emu Edit[0], and videos, Emu Video[1].

There's also LLaVa-interactive[2] for photos where you can even chat with the model about the current image.

[0]: https://emu-edit.metademolab.com/

[1]: https://emu-video.metademolab.com/

[2]: https://llava-vl.github.io/llava-interactive/

valine

The rate of progress in ML this past year has been breath taking.

I can’t wait to see what people do with this once controlnet is properly adapted to video. Generating videos from scratch is cool, but the real utility of this will be the temporal consistency. Getting stable video out of stable diffusion typically involves lots of manual post processing to remove flicker.

alberth

What was the big “unlock” that allowed so much progress this past year?

I ask as a noob in this area.

4death4

I think these are the main drivers behind the progress:

- Unsupervised learning techniques, e.g. transformers and diffusion models. You need unsupervised techniques in order to utilize enough data. There have been other unsupervised techniques in the past, e.g. GANs, but they don't work as well.

- Massive amounts of training data.

- The belief that training these models will produce something valuable. It costs between hundreds of thousands to millions of dollars to train these models. The people doing the training need to believe they're going to get something interesting out at the end. More and more people and teams are starting to see training a large model as something worth pursuing.

- Better GPUs, which enables training larger models.

- Honestly the fall of crypto probably also contributed, because miners were eating a lot of GPU time.

mkaic

I don't think transformers or diffusion models are inherently "unsupervised", especially not the way they're used in Stable Diffusion and related models (which are very much trained in a supervised fashion). I agree with the rest of your points though.

JCharante

> The belief that training these models will produce something valuable

Exactly. The growth in the next decade is going to be unimaginable because now governments and MNCs believe that there realistically be progress made in this field.

Cyphase

One factor is that Stable Diffusion and ChatGPT were released within 3 months of each other – August 22, 2022 and November 3, 2022, respectively. That brought a lot of attention and excitement to the field. More excitement, more people, more work being done, more progress.

Of course those two releases didn't fall out of the sky.

JCharante

Dalle 2 also went viral around the same time

mlboss

Stable diffusion open source release and llama release

alberth

But what technically allowed for so much progress?

There’s been open source AI/ML for 20+ years.

Nothing comes close to the massive milestones over the past year.

password54321

There has been massive progress in ML every year since 2013, partly due to better GPUs and lots of training data. Many are only taking notice now that it is in products but it wasn't that long ago there was skepticism on HN even when software like Codex existed in 2021.

moritonal

Where do you want to start? The Internet collection and structuring the world's knowledge into a few key repositories? The focus on GPUs in gaming and then the crypto market creating a suite of libraries dedicated to hard scaling math. Or then the miniaturization and focus on energy efficiency due to phones making scaled training cost-effective. Finally the papers released by Google and co which didn't seem to recognise quite how easy it would be to build and replicate upon. Nothing was unlocked apart from a lot of people suddenly noticed how doable all this already was.

marricks

I mean, you probably didn't pay much attention to battery capacity before phones, laptops, and electric cars, right? Battery capacity has probably increased though at some rate before you paid attention. It's just when something actually becomes relevant that we notice.

Not that more advances don't happen with sustained hype, just there's some sort of tipping point involving usefulness based either on improvement of the thing in question or it's utility elsewhere.

throwaway290

MS subsidizing it with 10 billions USD and (un)healthy contempt towards copyright.

Der_Einzige

Controlnet is adapted to video today, the issues are that it's very slow. Haven't you seen the insane quality of videos on civitai?

valine

I have seen them, the workflows to create those videos are extremely labor intensive. Control net lets you maintain poses between frames, it doesn’t solve the temporal consistency of small details.

mattnewton

People use animatediff’s motion module (or other models that have cross frame attention layers). Consistency is close to being solved.

capableweb

> Haven't you seen the insane quality of videos on civitai?

I have not, so I went to https://civitai.com/ which I guess is what you're talking about? But I cannot find a single video there, just images and models.

Kevin09210

https://www.youtube.com/shorts/ZN-NbdFwfNQ

https://www.youtube.com/watch?v=3WWy98ylLT4

https://www.youtube.com/shorts/1vqOjYWEF84

https://www.youtube.com/shorts/jOIb9QbrhZ8

https://www.youtube.com/shorts/C3F_YI84TXA

https://www.youtube.com/shorts/4IqJHozY4F0

https://www.youtube.com/shorts/h3OmBLlm5-g

https://www.youtube.com/shorts/ZT7tuIgSDRk

https://www.youtube.com/shorts/WnUYbsOMyvs

https://www.youtube.com/shorts/BKKqX2aMlSg

The inconsistencies are what's most interesting in these videos in fact

adventured

https://civitai.com/images

Go there, in the top right of the content area it has two drop-downs: Most Reactions | Filters

Under filters, change the media setting to video.

Civitai has a notoriously poor layout for finding/browsing things unfortunately.

dragonwriter

A small percentage of the images are animations. This id (for obvious reasons) particularly common for images used on the catalog pages for animation-related tools and models, but also its not uncommon for (AnimateDiff-based, mostly) animations to be used to demo the output of other models.

kornesh

Yeah, solving the flickering problem and achieving temporal consistency will be the key to realize the full potential of generative video models.

Right now, AnimateDiff is leading the way in consistency but I'm really excited to see what people will do with this new model.

hanniabu

> but the real utility of this will be the temporal consistency

The main utility will me misinformation

undefined

[deleted]

firefoxd

I understand the magnitude of innovation that's going on here. But still feel like we are generating these videos with both hands tied behind our backs. In other words, it's nearly impossible to edit the videos in this constraints. (Imagine trying to edit the blue Jays to get the perfect view).

Since videos are rarely consumed raw, what if this becomes a pipeline in Blender instead? (Blender the 3d software). Now the video becomes a complete scene with all the key elements of the text input animated. You have your textures, you have your animation, you have your camera, you have all the objects in place. We can even have the render engine in the pipeline to increase the speed of video generation.

It may sound like I'm complaining, but I'm just ask making a feature request...

huytersd

What would solve all these issues is full generation of 3D models that we hopefully get a chance to see over the next decade. I’ve been advocating for a solid LiDAR camera on the iPhone so there is a lot of training data for these LLMs.

ricardobeat

> I’ve been advocating for a solid LiDAR camera on the iPhone

What do you mean by “advocating”? The iPhone has had a LiDAR camera since 2020.

xvector

That's probably why they qualified with "solid", the iPhone's LiDAR camera is quite terrible.

jwoodbridge

we're working on this - dream3d.com

ericpauley

I'm still puzzled as to how these "non-commercial" model licenses are supposed to be enforceable. Software licenses govern the redistribution of the software, not products produced with it. An image isn't GPL'd because it was produced with GIMP.

yorwba

The license is a contract that allows you to use the software provided you fulfill some conditions. If you do not fulfill the conditions, you have no right to a copy of the software and can be sued. This enforcement mechanism is the same whether the conditions are that you include source code with copies you redistribute, or that you may only use it for evil, or that you must pay a monthly fee. Of course this enforcement mechanism may turn out to be ineffective if it's hard to discover that you're violating the conditions.

comex

It also somewhat depends on open legal questions like whether models are copyrightable and, if so, whether model outputs are derivative works of the model. Suppose that models are not copyrightable, due to their not being the product of human creativity (this is debatable). Then the creator can still require people to agree to contractual terms before downloading the model from them, presumably including the usage limitations as well as an agreement not to redistribute the model to anyone else who does not also agree. Agreement can happen explicitly by pressing a button, or potentially implicitly just by downloading the model from them, if the terms are clearly disclosed beforehand. But if someone decides on their own (not induced by you in any way) to violate the contract by uploading it somewhere else, and you passively download it from there, then you may be in the clear.

ronsor

> Then the creator can still require people to agree to contractual terms before downloading the model from them, presumably including the usage limitations as well as an agreement not to redistribute the model to anyone else who does not also agree.

I don't think it's possible to invent copyright-like rights.

SXX

It doesn't have to be enforceable. This licensing model works exactly the same as Microsoft Windows licensing or WinRAR licensing. Lots and lots of people have pirated Windows or just buy some cheap keys off Ebay, but no one of them in their sane mind would use anything like that at their company.

The same way you can easily violate any "non-commercial" clauses of models like this one as private person or as some tiny startup, but company that decide to use them for their business will more likely just go and pay.

So it's possible to ignore license, but legal and financial risks are not worth it for businesses.

taberiand

I've heard companies also intentionally do not go after individuals pirating software e.g., Adobe Photoshop - it benefits them to have students pirate and skill up on their software and then enter companies that buy Photoshop because their employees know it, over locking down and having those students, and then the businesses, switch to open source.

Duanemclemore

I'm sure there are plenty of other examples, but in my personal experience this was Autodesk's strategy with AutoCAD. Get market saturation by being extremely light on piracy. Then, once you're the only one standing lower the boom. I remember, it was almost like flipping a switch on a single DAY in the mid-00's when they went from totally lax on unpaid users to suing the bejeezus out of anyone who they had good enough documentation on.

One smart thing they did was they'd check the online job listings and if a firm advertised for needing AutoCAD experience they'd check their licenses. I knew firms who got calls from Autodesk legal the DAY AFTER posting an opening.

dist-epoch

Visual Studio Community (and many other products) only allows "non-commercial" usage. Sounds like it limits what you can do with what you produce with it.

At the end of the day, a license is a legal contract. If you agree that an image which you produce with some software will be GPL'ed, it's enforceable.

As an example, see the Creative Commons license, ShareAlike clause:

> If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

blibble

> At the end of the day, a license is a legal contract. If you agree that an image which you produce with some software will be GPL'ed, it's enforceable.

you can put whatever you want in a contract, doesn't mean it's enforceable

antonyt

Do you have link for the VS Community terms you're describing? What I've found is directly contradictory: "Any individual developer can use Visual Studio Community to create their own free or paid apps." From https://visualstudio.microsoft.com/vs/community/

dist-epoch

Enterprise organizations are not allowed to use VS Community for commercial purposes:

> In enterprise organizations (meaning those with >250 PCs or >$1 Million US Dollars in annual revenue), no use is permitted beyond the open source, academic research, and classroom learning environment scenarios described above.

kmeisthax

So, there's a few different things interacting here that are a little confusing.

First off, you have copyright law, which grants monopolies on the act of copying to the creators of the original. In order to legally make use of that work you need to either have permission to do so (a license), or you need to own a copy of the work that was made by someone with permission to make and sell copies (a sale). For the purposes of computer software, you will almost always get rights to the software through a license and not a sale. In fact, there is an argument that usage of computer software requires a license and that a sale wouldn't be enough because you wouldn't have permission to load it into RAM[0].

Licenses are, at least under US law, contracts. These are Turing-complete priestly rites written in a special register of English that legally bind people to do or not do certain things. A license can grant rights, or, confusingly, take them away. For example, you could write a license that takes away your fair use rights[1], and courts will actually respect that. So you can also have a license that says you're only allowed to use software for specific listed purposes but not others.

In copyright you also have the notion of a derivative work. This was invented whole-cloth by the US Supreme Court, who needed a reason to prosecute someone for making a SSSniperWolf-tier abridgement[2] of someone else's George Washington biography. Normal copyright infringement is evidenced by substantial similarity and access: i.e. you saw the original, then you made something that's nearly identical, ergo infringement. The law regarding derivative works goes a step further and counts hypothetical works that an author might make - like sequels, translations, remakes, abridgements, and so on - as requiring permission in order to make. Without that permission, you don't own anything and your work has no right to exist.

The GPL is the anticopyright "judo move", invented by a really ornery computer programmer that was angry about not being able to fix their printer drivers. It disclaims almost the entire copyright monopoly, but it leaves behind one license restriction, called a "copyleft": any derivative work must be licensed under the GPL. So if you modify the software and distribute it, you have to distribute your changes under GPL terms, thus locking the software in the commons.

Images made with software are not derivative works of the software, nor do they contain a substantially similar copy of the software in them. Ergo, the GPL copyleft does not trip. In fact, even if it did trip, your image is still not a derivative work of the software, so you don't lose ownership over the image because you didn't get permission. This also applies to model licenses on AI software, insamuch as the AI companies don't own their training data[3].

However, there's still something that licenses can take away: your right to use the software. If you use the model for "commercial" purposes - whatever those would be - you'd be in breach of the license. What happens next is also determined by the license. It could be written to take away your noncommercial rights if you breach the license, or it could preserve them. In either case, however, the primary enforcement mechanism would be a court of law, and courts usually award money damages. If particularly justified, they could demand you destroy all copies of the software.

If it went to SCOTUS (unlikely), they might even decide that images made by software are derivative works of the software after all, just to spite you. The Betamax case said that advertising a copying device with potentially infringing scenarios was fine as long as that device could be used in a non-infringing manner, but then the Grokster case said it was "inducement" and overturned it. Static, unchanging rules are ultimately a polite fiction, and the law can change behind your back if the people in power want or need it to. This is why you don't talk about the law in terms of something being legal or illegal, you talk about it in terms of risk.

[0] Yes, this is a real argument that courts have actually made. Or at least the Ninth Circuit.

The actual facts of the case are even more insane - basically a company trying to sue former employees for fixing it's customers computers. Imagine if Apple sued Louis Rossman for pirating macOS every time he turned on a customer laptop. The only reason why they can't is because Congress actually created a special exemption for computer repair and made it part of the DMCA.

[1] For example, one of the things you agree to when you buy Oracle database software is to give up your right to benchmark the software. I'm serious! The tech industry is evil and needs to burn down to the ground!

[2] They took 300 pages worth of material from 12 books and copied it into a separate, 2 volume work.

[3] Whether or not copyright on the training data images flows through to make generated images a derivative work is a separate legal question in active litigation.

dragonwriter

> Licenses are, at least under US law, contracts

Not necessarily; gratuitous licenses are not contracts. Licenses which happen to also meet the requirements for contracts (or be embedded in agreements that do) are contracts or components of contracts, but that's not all licenses.

rperez333

If a company train the model from scratch, on its own dataset, could the resulting model be used commercially?

cubefox

Nobody claimed otherwise?

not2b

There are sites that make Stable Diffusion-derived models available, along with GPU resources, and they sell the service of generating images from the models. The company isn't permitting that use, and it seems that they could find violators and shut them down.

littlethoughts

Fantasy.ai was subject to controversy for attempting to license models.

Der_Einzige

They're not enforceable.

stevage

A software licence can definitely govern who can use it and what they can do with it.

> An image isn't GPL'd because it was produced with GIMP.

That's because of how the GPL is written, not because of some limitation of software licences.

accrual

Fascinating leap forward.

It makes me think of the difference between ancestral and non-ancestral samplers, e.g. Euler vs Euler Ancestral. With Euler, the output is somewhat deterministic and doesn't vary with increasing sampling steps, but with Ancestral, noise is added to each step which creates more variety but is more random/stochastic.

I assume to create video, the sampler needs to lean heavily on the previous frame while injecting some kind of sub-prompt, like rotate <object> to the left by 5 degrees, etc. I like the phrase another commenter used, "temporal consistency".

Edit: Indeed the special sauce is "temporal layers". [0]

> Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets

[0] https://stability.ai/research/stable-video-diffusion-scaling...

adventured

The hardest problem the Stable Diffusion community has dealt with in terms of quality has been in the video space, largely in relation to the consistency between frames. It's probably the most commonly discussed problem for example on r/stablediffusion. Temporal consistency is the popular term for that.

So this example was posted an hour ago, and it's jumping all over the place frame to frame (somewhat weak temporal consistency). The author appears to have used pretty straight-forward text2img + Animatediff:

https://www.reddit.com/r/StableDiffusion/comments/180no09/on...

Fixing that frame to frame jitter related to animation is probably the most in-demand thing around Stable Diffusion right now.

Animatediff motion painting made a splash the other day:

https://www.reddit.com/r/StableDiffusion/comments/17xnqn7/ro...

It's definitely an exciting time around SD + animation. You can see how close it is to reaching the next level of generation.

shaileshm

This field moves so fast. Blink an eye and there is another new paper. This is really cool and the learning speed of us humans is insane! Really excited on using it for downstream tasks! I wonder how easy it is to integrate animatediff with this model?

Also, can someone benchmark it on m3 devices? It would be cool to see if it is worth getting on to run these diffusion inferences and development. If m3 pro can allow finetuning it would be amazing to use it on downstream tasks!

awongh

It makes sense that they had to take out all of the cuts and fades from the training data to improve results.

I’m the background section of the research paper they mention “temporal convolution layers”, can anyone explain what that is? What sort of training data is the input to represent temporal states between images that make up a video? Or does that mean something else?

flaghacker

It means that instead of (only) doing convolution in spatial dimensions, it also(/instead) happens in the temporal dimension.

A good resource for the "instead" case: https://unit8.com/resources/temporal-convolutional-networks-...

The "also" case is an example of 3D convolution, an example of a paper that uses it: https://www.cv-foundation.org/openaccess/content_iccv_2015/p...

machinekob

I would assume is something similar to joining multiple frames/attentions? in channel dimension and then moving values inside so convolution will have access to some channels from other video frames.

I was working on similar idea few years ago using this paper as reference and it was working extremely well for consistency also helping with flicker. https://arxiv.org/abs/1811.08383

epiccoleman

This is really, really cool. A few months ago I was playing with some of the "video" generation models on Replicate, and I got some really neat results[1], but it was very clear that the resulting videos were made from prompting each "frame" with the previous one. This looks like it can actually figure out how to make something that has a higher level context to it.

It's crazy to see this level of progress in just a bit over half a year.

[1]: https://epiccoleman.com/posts/2023-03-05-deforum-stable-diff...

christkv

Looks like I'm still good for my bet with some friends that before 2028 a team of 5-10 people will create a blockbuster style movie that today costs 100+ million USD on a shoestring budget and we won't be able to tell.

ben_w

I wouldn't bet either way.

Back in the mid 90s to 2010 or so, graphical improvements were hailed as photorealistic only to be improved upon with each subsequent blockbuster game.

I think we're in a similar phase with AI[0]: every new release in $category is better, gets hailed as super fantastic world changing, is improved upon in the subsequent Two Minute Papers video on $category, and the cycle repeats.

[0] all of them: LLMs, image generators, cars, robots, voice recognition and synthesis, scientific research, …

Keyframe

Your comment reminded me of this: https://www.reddit.com/r/gaming/comments/ktyr1/unreal_yes_th...

Many more examples, of course.

ben_w

Yup, that castle flyby, those reflections. I remember being mesmerised by the sequence as a teenager.

Big quality improvement over Marathon 2 on a mid-90s Mac, which itself was a substantial boost over the Commodore 64 and NES I'd been playing on before that.

Sohcahtoa82

> Back in the mid 90s to 2010 or so, graphical improvements were hailed as photorealistic

Whenever I saw anybody calling those graphics "photorealistic", I always had to roll my eyes and question if those people were legally blind.

Like, c'mon. Yeah, they could be large leaps ahead of the previous generation, but photorealistic? Get real.

Even today, I'm not sure there's a single game that I would say has photo-realistic graphics.

ben_w

> Even today, I'm not sure there's a single game that I would say has photo-realistic graphics.

Looking just at the videos (because I don't have time to play the latest games any more and even if I did it's unreleased), I think that "Unrecord" is also something I can't distinguish from a filmed cinematic experience[0]: https://store.steampowered.com/app/2381520/Unrecord/

Though there are still caveats even there, as the pixelated faces are almost certainly necessary given the state of the art; and because cinematic experiences are themselves fake, I can't tell if the guns are "really-real" or "Hollywood".

Buuuuut… I thought much the same about Myst back in the day, and even the bits that stayed impressive for years (the fancy bedroom in the Stoneship age), don't stand out any more. Riven was better, but even that's not really realistic now. I think I did manage to fool my GCSE art teacher at the time with a printed screenshot from Riven, but that might just have been because printers were bad at everything.

deckard1

I'm imagining more of an AI that takes a standard movie screenplay and a sidecar file, similar to a CSS file for the web and generates the movie. This sidecar file would contain the "director" of the movie, with camera angles, shot length and speed, color grading, etc. Don't like how the new Dune movie looks? Edit the stylesheet and make it your own. Personalized remixed blockbusters.

On a more serious note, I don't think Roger Deakins has anything to worry about right now. Or maybe ever. We've been here before. DAWs opened up an entire world of audio production to people that could afford a laptop and some basic gear. But we certainly do not have a thousand Beatles out there. It still requires talent and effort.

timeon

> thousand Beatles out there. It still requires talent and effort

As well as marketing.

CamperBob2

It'll happen, but I think you're early. 2038 for sure, unless something drastic happens to stop it (or is forced to happen.)

marcusverus

I'm pumped for this future, but I'm not sure that I buy your optimistic timeline. If the history of AI has taught us anything, it is that the last 1% of of progress is the hardest half. And given the unforgiving nature of the uncanny valley, the video produced by such a system will be worthless until it is damn-near perfect. That's a tall order!

accrual

The first full-length AI generated movie will be an important milestone for sure, and will probably become a "required watch" for future AI history classes. I wonder what the Rotten Tomatoes page will look like.

jjkaczor

As per the reviews - it will be hard to say, as both positive and negative takes will be uploaded by ChatGPT bots (or it's myriad of descendents).

qiine

"I wonder what the Rotten Tomatoes page will look like"

Surely it will be written using machine vision and llms !

throwaway743

Definitely a big first for benchmarks. After that hyper personalized content/media generated on-demand

henriquecm8

What I am really looking forward is some Star Trek style holodeck, but I guess we will start with it in VR headsets first.

Geordi: "Computer, in the Holmesian style, create a mystery to confound Data with an opponent who has the ability to defeat him"

rbhuta

VRAM requirements are big for this launch. We're hosting this for free at https://app.decoherence.co/stablevideo. Disclaimer: Google log-in required to help us reduce spam.

xena

How big is big?

whywhywhywhy

40GB although hearing reports 3090 can do low frame counts

zvictor

it's worth paying your subscription just for these free videos. would those have the watermark removed if I go "Basic"?

spaceman_2020

A seemingly off topic question, but with enough compute and optimization, could you eventually simulate “reality”?

Like, at this point, what are the technical counters to the assertion that our world is a simulation?

KineticLensman

(disclaimer: worked in the sim industry for 25 years, still active in terms of physics-based rendering).

First off, there are zero technical proofs that we are in a sim, just a number of philosophical arguments.

In practical terms, we cannot yet simulate a single human cell at the molecular level, given the massive number of interactions that occur every microsecond. Simulating our entire universe is not technically possible within the lifetime of our universe, according to our current understanding of computation and physics. You either have to assume that ‘the sim’ is very narrowly focussed in scope and fidelity, and / or that the outer universe that hosts ‘the sim’ has laws of physics that are essentially magic from our perspective. In which case the simulation hypothesis is essentially a religious argument, where the creator typed 'let there be light' into his computer. If there isn't such a creator, the sim hypothesis 'merely' suggests that our universe, at its lowest levels, looks somewhat computational, which is an entirely different argument.

freedomben

I don't think you would need to simulate the entire universe, just enough of it that the consciousness receiving sense data can't encounter any missing info or "glitches" in the metaphorical matrix. Still hard of course, but substantially less compute intensive than every molecule in the universe.

gcanyon

And if you’re in charge of the simulation, you get to decide how many “consciousnesses” there are, constraining them to be within your available compute. Maybe that’s ~8 billion — maybe it’s 1. Yeah, I’m feeling pretty Boltzmann-ish right now…

KineticLensman

> but substantially less compute intensive than every molecule in the universe

Very true, but to me this view of the universe and one's existence within it as a sort of second-rate solipsist bodge isn't a satisfyingly profound answer to the question of life the universe and everything.

Although put like that it explains quite a lot.

[Edit] There is also a sense in which the sim-as-a-focussed-mini-universe view is even less falsifiable, because sim proponents address any doubt about the sim by moving the goal posts to accommodate what they claim is actually achievable by the putative creator/hacker on Planet Tharg or similar.

kaashif

And you don't have to simulate it in real time, maybe 1 second here takes years or centuries to simulate outside the simulation. It's not like we'd have any way to tell.

jdaxe

Maybe something like quantum mechanics are an "optimization" of the sim, i.e the sim doesn't actually compute the locations, spin etc of subatomic particles but instead just uses probabilities to simulate it. Only when a consciousness decides to look more closely does it retroactively decide what those properties really were.

Kind of like how video games won't render the full resolution textures when the character is far away or zoomed out.

I'm sure I'm not the first person to have thought this.

tracerbulletx

The brain does simulate reality in the sense that what you experience isn't direct sensory input, but more like a dream being generated to predict what it thinks is happening based on conflicting and imperfect sensory input.

accrual

To illustrate your point, an easily accessible example of this is how the second hand on clocks appears to freeze for longer than a second when you quickly glance at it. The brain is predicting/interpolating what it expects to see, creating the illusion of a delay.

https://www.popsci.com/how-time-seems-to-stop/

danielbln

Example vision: comes in from the optic nerve warped and upside down and as small patches of high resolution captured by the eyes zigzagging across the visual field (saccades), all of which is assembled and integrated into a coherent field of vision by our trusty old grey blob.

beepbooptheory

Why does it matter? Not trying to dismiss, but truly, what would it mean to you if you could somehow verify the "simulation"?

If it would mean something drastic to you, I would be very curious to hear your preexisting existential beliefs/commitments.

People say this sometimes and its kind of slowly revealed to me that its just a new kind of geocentrism: its not just a simulation people have in mind, but one where earth/humans are centered, and the rest of the universe is just for the benefit of "our" part of the simulation.

Which is a fine theory I guess, but is also just essentially wanting God to exist with extra steps!

2-718-281-828

> Like, at this point, what are the technical counters to the assertion that our world is a simulation?

How about this theory is neither verifiable nor falsifiable.

vidarh

The general concept is not falsifiable, but many variations might be, or their inverse might be. E.g. the theory that we are not in a simulation would in general be falsifiable by finding an "escape" from a simulation and so showing we are in one (but not finding an escape of course tells us nothing).

It's not a very useful endeavour to worry about, but it can be fun to speculate about what might give rise to testable hypotheses and what that might tell us about the world.

undefined

[deleted]

sesm

There can be no technical counters to the assertion that our world is a simulation. If our world is a simulation, then hardware/software that simulates it is outside of our world and it's technical constitution is inaccessible to us.

It's purely a religious question. When humanity invented the wheel, religion described the world as a giant wheel rotating in cycles. When humanity invented books, religion described the world as a book, and God as a it's writer. When humanity invented complex mechanism, religion described the world as giant mechanism and God as a watchmaker. Then computers where invented, and you can guess what happened next.

refulgentis

A little too freshman's first bit off a bong for me. There is, of course, substantial differences between video and reality.

Let's steel-man — you mean 3D VR. Let's stipulate there's a headset today that renders 3D visually indistinguishable from reality. We're still short the other 4 senses

Much like faith, there's always a way to sort of escape the traps here and say "can you PROVE this is base reality"

The general technical argument against "brain in a vat being stimulated" would be the computation expense of doing such, but you can also write that off with the equivalent of foveated rendering but for all senses / entities

SXX

Actually it was already done by sentdex with GAN Theft Auto:

https://youtu.be/udPY5rQVoW0

To an extent...

PS: Video is 2 years old, but still really impressive.

justanotherjoe

That theory was never meant to be so airtight such that it 'needs' to be refuted.

aliljet

I've been following this space very very closely and the killer feature would be to be able to generate these full featured videos for longer than a few seconds with consistently shaped "characters" (e.g., flowers, and grass, and houses, and cars, actors, etc.). Right now, it's not clear to me that this is achieving that objective. This feels like it could be great to create short GIFs, but at what cost?

To be clear, this remains wicked, wicked, wicked exciting.

Daily Digest email

Get the top HN stories in your inbox every day.