Skip to content(if available)orjump to list(if available)

How Imagen Works

How Imagen Works


·June 23, 2022


> The central intuition in using T5 is that extremely large language models, by virtue of their sheer size alone, may still learn useful representations despite the fact that they are not explicitly trained with any text/image task in mind. [...] Therefore, the central question being addressed by this choice is whether or not a massive language model trained on a massive dataset independent of the task of image generation is a worthwhile trade-off for a non-specialized text encoder. The Imagen authors bet on the side of the large language model, and it is a bet that seems to pay off well.

The way out of this dilemma is to fine-tune T5 on the caption dataset instead of keeping it frozen. The paper notes that they don't do fine-tuning, but does not provide any ablation or other justification. I wonder if it would help or not.




> is trained on hundreds of millions of images and their associated captions

So how do you get access to hundreds of millions of images and use them to create derivative works? Did they get consent from millions of authors?

Or is something like that only available to the rich with access to lawyers on tap?

I mean I can imagine if a nobody wanted to do something like this, they'd get bankrupted by having to deal with all the photographers / artists spotting a tiny sliver of their art in the image produced by the model.

Furthermore, would something like this work with music? For instance, train the model on all Spotify songs and then generate songs based on "Get me a Bach symphony played on sticks with someone rapping like Dr Dre with lisp." Or do music industry have enough money to bully anyone into not doing that?


There are open datasets with that many image-text pairs. E.g. There is even a dataset with 5 billion image-text pairs if you're feeling adventurous:


I didn't know about this! Thank you


Presumably Google's terms of service or fair use laws. The real restriction is that, even if you had the dataset, training costs tens of thousands of dollars. Only corporations can really afford to train these things.

Regarding music - audio generation with Diffusion Models (the main component of Imagen and DALL-E 2) has been done, but not sure about music specifically. We will definitely reach the point where most e.g. pop beats will be able to be made by AI relatively soon.

All a producer has to do is generate 100 beats and select the one s/he likes, potentially interpolate between 2 or finetune it.


This is a real issue, but it's solvable with work.

It's claimed that ML models' output isn't copyrightable because it's fair use, but that's hard to believe; a large model can easily memorise and output exactly one of its inputs again. This is easier to see with text, where GPT and Copilot both do it, but images can do it too.

> So how do you get access to hundreds of millions of images and use them to create derivative works? Did they get consent from millions of authors?

Build the model out of Creative Commons images only. There's a lot of 'em and it's good enough. You may need to exclude CC-BY since they currently can't follow the attribution requirement.

> Or is something like that only available to the rich with access to lawyers on tap?

More likely companies willing to license a stock photography database.


I've seen an image generated by AI contain an "Alamy" watermark before.


Is there a compare and contrast between Imagen and Parti anywhere? I realize the paper came out yesterday, but maybe other people remember what "autoregressive" means better than I do.


Upon first inspection, Parti is not as good. This is perhaps unsurprising - in DALL-E 2 the prior model tested between autoregressive and diffusion models and the diffusion model outperformed


I have shown imagen (and dalle2) to a number of people now (non-tech, just everyday friends, family, co-workers) and I have been pretty stunned by the response I get from most people:

"Meh, that's kinda cool? I guess?" or "What am I looking at?"..."Ok? So a computer made it? That seems neat"

To me I am still trying to get my jaw off the floor from 2 months ago. But the responses have been so muted and shoulder shrugging that I think either I am missing something or they are missing something. Even really drilling in, practically shaking them "DO YOU NOT UNDERSTAND THAT THIS IS A ORIGINAL IMAGE CONSTRUCTED ENTIRELY BY AN AI?!?!" and people just seem to see it as a party trick at best.


I think I can explain this that for most people the whole world is basically magic anyway. They don’t understand any of the details about how any digital tech works so to them they have no framework for which things are impressive and which things are not. The just know that computers can do a great many things that they know nothing about. “Oh I can bank online? Ok.” “Oh, I can have the computer write my book report for me? Ok.” “Oh, this McDonalds is fully staffed by sentiment robots? Ok.”


A pretty common generalization I've witnessed is many non technical people (even people who are tech savvy but have no CS background do this) is people assuming the feature that is in reality quite difficult to implement won't take much effort, and vice versa.






I think that hits home.

A lot of people would just answer something to the likes of "Well, they made The Matrix with a computer 20 years ago", and technically that's just as true.

From their remote viewpoint on what's happening in IT, the rest is an implementation detail to them.


This is the other side of the classic XKCD "Tasks" (

A non-technical person in 2014 (when the above was originally published) would likely have the same conception of the difficulty of recognizing a bird from an image as they would in 2022, even though the task itself has gone from near-insurmountable to off-the-shelf-library in eight years.

Even as Imagen and Dall-E 2 amaze us today, these feats will likely be commonplace in a few years. The non-technical may have only a vague sense that their new TikTok filter is doing something that was impossible only a few years prior.


Exactly and I was thinking of that XKCD. Very much case in point, I have the Merlin Bird ID app which can determine species from ridiculously blurry photos and can also identify hundreds of birds from their calls alone in noisy environments. In 2014 I would have sworn this would be impossible.


The tooltip you get when you hover your cursor over the comic:

"In the 60s, Marvin Minsky assigned a couple of undergrads to spend the summer programming a computer to use a camera to identify objects in a scene. He figured they'd have the problem solved by the end of the summer. Half a century later, we're still working on it."

I'm working with his son Henry Minsky and other great people at Leela AI on that same old problem, applying hybrid symbolic-connectionist constructivist AI by combining neat neural networks with scruffy symbolic logic to understand video, and it's mind boggling what is possible now:

>Our AI system, Leela, is motivated by intrinsic curiosity. Leela creates theories about cause and effect in her world, and then conducts experiments to test these theories. Leela can connect all her knowledge and use this network to make plans, reason about goals, and communicate using grounded natural language.

>Leela has at her core a hybrid symbolic-connectionist network. This means that she uses a dynamic combination of artificial neural networks and symbol networks to learn. Hybrid networks open the door to AI agents that can build their own abstractions on the fly, while still taking full advantage of the power of deep learning.

>Neats and scruffies: Neat and scruffy are two contrasting approaches to artificial intelligence (AI) research. The distinction was made in the 70s and was a subject of discussion until the middle 80s. In the 1990s and 21st century AI research adopted "neat" approaches almost exclusively and these have proven to be the most successful.

>"Neats" use algorithms based on formal paradigms such as logic, mathematical optimization or neural networks. Neat researchers and analysts have expressed the hope that a single formal paradigm can be extended and improved to achieve general intelligence and superintelligence.

>"Scruffies" use any number of different algorithms and methods to achieve intelligent behavior. Scruffy programs may require large amounts of hand coding or knowledge engineering. Scruffies have argued that the general intelligence can only be implemented by solving a large number of essentially unrelated problems, and that there is no magic bullet that will allow programs to develop general intelligence autonomously.

>The neat approach is similar to physics, in that it uses simple mathematical models as its foundation. The scruffy approach is more like biology, where much of the work involves studying and categorizing diverse phenomena.

We're looking for talented engineers and designers to help, including neats and scruffies working together!


That is exactly what Will Wright (the creator of SimCity and The Sims, and Robot Wars / Battle Bots contestant) was getting at when we made these one-minute robot reality videos about "Empathy" and "Servitude".

His idea was to probe just how much random people on the street (or in a diner) would believe about autonomous intelligent robots operating in the real world.

Of course we were actually hiding behind the scenes tele-operating the robots through hidden cameras and a wireless web interface, listening to what the people said and making the robots respond with a voice synthesizer and sound effects, clicking on pre-written phrases and typing ad-libbed responses.

Empathy (a broken down robot begs for help from passers by on the streets of Oakland):

Servitude (a robot waiter takes orders and serves food in a diner in Oakland, making stupid mistakes and asking for a good review):

All his robots aren't as harmless, non-violent, polite, and obsequious as those two. Here's an old interview with Will at Robot Wars 1997:

Here is Super ChiaBot and her MiniBots, created by Will and his daughter Cassidy, getting its leaves shredded and body slammed at BattleBots:

Here's a more recent video of Will throwing a tantrum about the failure of SimSandwich, destroying his old creations because they're pixely and poorly rendered, then complaining about how those jerks at EA hate him:


"Oh, they have the internet on computers now!" Homer J Simpson.


I think if you've been paying attentiont to the space, this generation of image diffusion is shocking in how quickly it has improved on what we had a year ago.

But if you've never considered that a computer can produce an original image, this is just a new thing computers can do. OTOH I think it's also a lack of imagination in how useful this is, so far the output has been kind of random, so it seems a little gimmicky. Already "Parti" has gotten much closer to allowing a user to describe exactly what they want in the image, and as people start to see the use cases for them personally, it will hit them that they no longer have to hire someone, they can just type a request into a box.


You can just type a request in the box if you don't particularly care what the result looks like and also don't care that some of the features might be copyrighted (since large models are quite capable of memorizing their training data.)

Asking for two different images in a series that have similar "art styles" is going to be enough work to still need a specialist aka an artist; it'll be most useful in cases you never would've bothered finding one before.


> Asking for two different images in a series that have similar "art styles" is going to be enough work to still need a specialist aka an artist

Running a separate style transfer network on the generated images is currently possible, although won't achieve the best possible results.

I wouldn't be surprised in the near future to see generation models that can take a text prompt and an image to mimic the style of, which could let it take style into account when generating the image rather than at just the surface level.


I'm not sure there has been a period of more rapid development in DL than Diffusion Models (maybe transformers?). The next few years will be really interesting.


Its because people have been able to do this for years now, and so did you. You can try right now. Go to google, type "cat on a bicycle" and hit image search. TADA, computer made cats on bicycles images appear! Wheres the magic in that?


Yeah, about that. Ask it to draw you a fast inverse square root.


I love this comment, and I'd love to see an AI conjecture an explanation as to why I love it...


People dont care because all their text to image needs are well covered by Google Images.


Perhaps it's the combination of AI being so overhyped in the general public plus media that's already inundated with CGI, that it just doesn't blow them away?


I've made perhaps overly absolutist statements like "don't you see! this kills artists jobs!" and it was shrugged off as if I was insane. I probably could've phrased it differently, but to me this is game changing in several fields. Granted, it will open up a new field of "generative artists" but, having played with these things, this is a pretty trivial job, and their training nets are only going to get better.


I’ve had a lot of fun playing with Disco Diffusion prompts, but I agree that the people excited about “a generation of prompt artists” are a bit misguided. Soon an AI will emerge that can come up with “better” prompts than you, and the “art” of creating prompts will have a lower skill ceiling.


Like a neutral network just for making prompts that result in aesthetically pleasing Imagen images? And then maybe we can come up with a neutral net that can decide which pictures are good and which aren't. Then we can just have robots making art for the sake of consumption solely by robots.


The GPT algorithms are actually pretty good at making detailed image generation prompts if you ask it to describe in detail the general idea you want.




To me, it paves the way for creative prototyping. I don't see this as a zero-sum game between artists and AI. Instead, I could see artists using this for some serious time saving, and leveraging that extra time and energy for creating better results.


It could also be used for more nefarious reasons like disinformation campaigns though... it will be interesting to see what the next few years have in store


You don't need good-looking pictures for propaganda. Old people (the main targets) believe literally anything they see on Facebook, especially if it confirms their priors aka fits their worldview, and prefer it to look bad because that's more authentic. For anyone else, the point is to make them disbelieve everything, not to believe you specifically.


Over a decade ago, Will Wright (of SimCity fame) faked conversational AGI robots in the streets and restaurants of Oakland. It consistently took people 2.4 seconds to go from “Oh look. The robots have arrived.” to “And, I’ll have fries with that.”

Hollywood and the media have taught the public that tech is literally magic and can do literally anything. “Anything” is expected and pedestrian.


I often think a similar thing about aliens. That is, instead of the panicking and hysteria or whatever that fiction imagines might accompany the discovery of aliens I fully expect that people will mostly go "Oh, neat. Aliens." And go on with their lives.


Well, I'm still in awe that I have a bunch of walls around me and can cover my body with clothes, or that I'm still alive after all this time, and that I can even rest most of the day and not spend body energy running after or from animals. Amazing stuff.

A program that transforms text to an image? Huh.


Is this by a person that knows or is guessing?


The paper is very well explained and, reading this post, they seems to mostly make its content accessible to non domain expert.


The important part seems to be the diffusion model.

Explanation linked from same page:


I guess he read the research paper.


Google published these implementation details


I wonder how developers can monetise this? What use cases does it have?




> Imagen, released just last month, can generate high-quality, high-resolution images given only a description of a scene

“Released”? What? Papers are published. Websites are published. Tools are “released.”

Where has Imagen been released?


This implementation popped up on hacker news not too long ago. I got it working on Colab first, and then my own GPU at home. But just barely. Need more memory :)


The value is in the data and the trained weights, the implementation is not where the bottleneck is in term of reproducing those models.

Still great work from the author though, but we most definitely cannot say that imagen is released.


Are there any large publicly available models, ready to fine tune and deploy, that were trained on massive data sets?

I really want to build services with these.




Wait, so I can try this on Colab right now?


No, something that's been causing a lotta confusion in AI art is people stand up quick implementations generally matching the general description in the paper, but, they're not really investing in training them. Then people see "imagen-pytorch" on GitHub and get confused, either think it's Imagen itself or a suitable replica of it.

There's like 3 projects named DallE, and then the 2 real DallEs...frustrating.


Super interesting


If Google has something similar or better it definitely makes it look like OpenAI is wasting its time. None of this relates to AGI.


I don't think anyone is saying that humanity is close to AGI, but check out DeepMind's Gato work for a more well-rounded agent:


I think we're past a certain threshold, maybe not AGI but some definite qualitative change is happening.


I mean DALL-E 2 was the first time my jaw really hit the floor, although in fairness GPT-3 probably should've done that, but it's easier to do with images.

And then for this to drop just a month later? Insane. It makes you wonder if they're actually releasing cutting edge, or Google decided to write this paper just because of the publication of DALL-E 2. Maybe they've had this model in the bag for a year.


Lots of people are saying that. I am saying that. OpenAI has it as a foundational mid-term goal.


What's the highest price paid for an AI-generated image NFT?


Unfortunately it seems like it's greater than 0...

If we ignore the procedurally generated NFTs created from mixing and matching various assets and go with ones where AI is the selling point, we're left with a few notable ones: Sophia, a robot w/ some low-level AI sold a single piece for 689k USD [^1]. Botto, a VQGAN-based algorithm sold a single piece for 430k USD and has sold multiple other pieces for tens to hundreds of thousands of dollars. Slightly more modest are some other projects like Metascapes [^3] and Eponym [^4], which produced some really tedious pieces that managed to sell for 3.5k USD and 10k USD respectively. That said, the Eponym piece seems to be some sort of self promotion, so maybe we can say that the actual prices for these collections are somewhere in the fraction of an ETH range if they can be sold at all.

Honestly, only the Botto piece is remotely interesting to look at, and even then I feel as if the blurred, "dreamy" aesthetic that seems to be in so many different AI painting approaches (style-transfer, VQGANS, DALL-E, maybe others I'm not aware of). I think it was more interesting back when we could pretend that these were the electric sheep at the fringes of some deep-sleeping latent intelligent potential but now they just feel kinda arbitrary and lacking deliberation. I absolutely love the field and think these researchers have done tremendous work, but I feel as though all the lay news attention is on the art, and not on the algorithm that generated it. The fascinating thing is that we have a machine that can produce novel something from words or basic ideas and that the output's content retains these ideas, not so much that art itself has that much compositional or stylistic merit.






AI is now creative