&gt; The central intuition in using T5 is that extremely large language models, by virtue of their sheer size alone, may still learn useful representations despite the fact that they are not explicitly trained with any text&#x2F;image task in mind. [...] Therefore, the central question being addressed by this choice is whether or not a massive language model trained on a massive dataset independent of the task of image generation is a worthwhile trade-off for a non-specialized text encoder. The Imagen authors bet on the side of the large language model, and it is a bet that seems to pay off well.The way out of this dilemma is to fine-tune T5 on the caption dataset instead of keeping it frozen. The paper notes that they don&#x27;t do fine-tuning, but does not provide any ablation or other justification. I wonder if it would help or not.

I didn&#x27;t know about this! Thank you

There are open datasets with that many image-text pairs. E.g. <a href="https:&#x2F;&#x2F;laion.ai&#x2F;blog&#x2F;laion-400-open-dataset&#x2F;" rel="nofollow">https:&#x2F;&#x2F;laion.ai&#x2F;blog&#x2F;laion-400-open-dataset&#x2F;</a> There is even a dataset with 5 billion image-text pairs if you&#x27;re feeling adventurous: <a href="https:&#x2F;&#x2F;laion.ai&#x2F;blog&#x2F;laion-5b&#x2F;" rel="nofollow">https:&#x2F;&#x2F;laion.ai&#x2F;blog&#x2F;laion-5b&#x2F;</a>

Presumably Google&#x27;s terms of service or fair use laws. The real restriction is that, even if you had the dataset, training costs tens of thousands of dollars. Only corporations can really afford to train these things.Regarding music - audio generation with Diffusion Models (the main component of Imagen and DALL-E 2) has been done, but not sure about music specifically. We will definitely reach the point where most e.g. pop beats will be able to be made by AI relatively soon.All a producer has to do is generate 100 beats and select the one s&#x2F;he likes, potentially interpolate between 2 or finetune it.

This is a real issue, but it&#x27;s solvable with work.It&#x27;s claimed that ML models&#x27; output isn&#x27;t copyrightable because it&#x27;s fair use, but that&#x27;s hard to believe; a large model can easily memorise and output exactly one of its inputs again. This is easier to see with text, where GPT and Copilot both do it, but images can do it too.&gt; So how do you get access to hundreds of millions of images and use them to create derivative works? Did they get consent from millions of authors?Build the model out of Creative Commons images only. There&#x27;s a lot of &#x27;em and it&#x27;s good enough. You may need to exclude CC-BY since they currently can&#x27;t follow the attribution requirement.&gt; Or is something like that only available to the rich with access to lawyers on tap?More likely companies willing to license a stock photography database.

I&#x27;ve seen an image generated by AI contain an &quot;Alamy&quot; watermark before.

&gt; is trained on hundreds of millions of images and their associated captionsSo how do you get access to hundreds of millions of images and use them to create derivative works? Did they get consent from millions of authors?Or is something like that only available to the rich with access to lawyers on tap?I mean I can imagine if a nobody wanted to do something like this, they&#x27;d get bankrupted by having to deal with all the photographers &#x2F; artists spotting a tiny sliver of their art in the image produced by the model.Furthermore, would something like this work with music? For instance, train the model on all Spotify songs and then generate songs based on &quot;Get me a Bach symphony played on sticks with someone rapping like Dr Dre with lisp.&quot;
Or do music industry have enough money to bully anyone into not doing that?

Upon first inspection, Parti is not as good. This is perhaps unsurprising - in DALL-E 2 the prior model tested between autoregressive and diffusion models and the diffusion model outperformed

Is there a compare and contrast between Imagen and Parti anywhere? I realize the paper came out yesterday, but maybe other people remember what &quot;autoregressive&quot; means better than I do.

A pretty common generalization I&#x27;ve witnessed is many non technical people (even people who are tech savvy but have no CS background do this) is people assuming the feature that is in reality quite difficult to implement won&#x27;t take much effort, and vice versa.

I think that hits home.A lot of people would just answer something to the likes of &quot;Well, they made The Matrix with a computer 20 years ago&quot;, and technically that&#x27;s just as true.From their remote viewpoint on what&#x27;s happening in IT, the rest is an implementation detail to them.

This is the other side of the classic XKCD &quot;Tasks&quot; (<a href="https:&#x2F;&#x2F;xkcd.com&#x2F;1425&#x2F;" rel="nofollow">https:&#x2F;&#x2F;xkcd.com&#x2F;1425&#x2F;</a>).A non-technical person in 2014 (when the above was originally published) would likely have the same conception of the difficulty of recognizing a bird from an image as they would in 2022, even though the task itself has gone from near-insurmountable to off-the-shelf-library in eight years.Even as Imagen and Dall-E 2 amaze us today, these feats will likely be commonplace in a few years. The non-technical may have only a vague sense that their new TikTok filter is doing something that was impossible only a few years prior.

That is exactly what Will Wright (the creator of SimCity and The Sims, and Robot Wars &#x2F; Battle Bots contestant) was getting at when we made these one-minute robot reality videos about &quot;Empathy&quot; and &quot;Servitude&quot;.His idea was to probe just how much random people on the street (or in a diner) would believe about autonomous intelligent robots operating in the real world.Of course we were actually hiding behind the scenes tele-operating the robots through hidden cameras and a wireless web interface, listening to what the people said and making the robots respond with a voice synthesizer and sound effects, clicking on pre-written phrases and typing ad-libbed responses.Empathy (a broken down robot begs for help from passers by on the streets of Oakland):<a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=KXrbqXPnHvE" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=KXrbqXPnHvE</a>Servitude (a robot waiter takes orders and serves food in a diner in Oakland, making stupid mistakes and asking for a good review):<a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=NXsUetUzXlg" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=NXsUetUzXlg</a>All his robots aren&#x27;t as harmless, non-violent, polite, and obsequious as those two. Here&#x27;s an old interview with Will at Robot Wars 1997:<a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=5nmbs0WqDQM" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=5nmbs0WqDQM</a>Here is Super ChiaBot and her MiniBots, created by Will and his daughter Cassidy, getting its leaves shredded and body slammed at BattleBots:<a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=DrArvRG2yQA" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=DrArvRG2yQA</a>Here&#x27;s a more recent video of Will throwing a tantrum about the failure of SimSandwich, destroying his old creations because they&#x27;re pixely and poorly rendered, then complaining about how those jerks at EA hate him:<a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=i-7F7s46-9A" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=i-7F7s46-9A</a>

&quot;Oh, they have the internet on computers now!&quot; Homer J Simpson.

I think I can explain this that for most people the whole world is basically magic anyway. They don’t understand any of the details about how any digital tech works so to them they have no framework for which things are impressive and which things are not. The just know that computers can do a great many things that they know nothing about. “Oh I can bank online? Ok.” “Oh, I can have the computer write my book report for me? Ok.” “Oh, this McDonalds is fully staffed by sentiment robots? Ok.”

You can just type a request in the box if you don&#x27;t particularly care what the result looks like and also don&#x27;t care that some of the features might be copyrighted (since large models are quite capable of memorizing their training data.)Asking for two different images in a series that have similar &quot;art styles&quot; is going to be enough work to still need a specialist aka an artist; it&#x27;ll be most useful in cases you never would&#x27;ve bothered finding one before.

I&#x27;m not sure there has been a period of more rapid development in DL than Diffusion Models (maybe transformers?). The next few years will be really interesting.

I think if you&#x27;ve been paying attentiont to the space, this generation of image diffusion is shocking in how quickly it has improved on what we had a year ago.But if you&#x27;ve never considered that a computer can produce an original image, this is just a new thing computers can do. OTOH I think it&#x27;s also a lack of imagination in how useful this is, so far the output has been kind of random, so it seems a little gimmicky. Already &quot;Parti&quot; has gotten much closer to allowing a user to describe exactly what they want in the image, and as people start to see the use cases for them personally, it will hit them that they no longer have to hire someone, they can just type a request into a box.

I love this comment, and I&#x27;d love to see an AI conjecture an explanation as to why I love it...

Its because people have been able to do this for years now, and so did you. You can try right now. Go to google, type &quot;cat on a bicycle&quot; and hit image search. TADA, computer made cats on bicycles images appear! Wheres the magic in that?&gt;THIS IS A ORIGINAL IMAGEYeah, about that. Ask it to draw you a fast inverse square root.

People dont care because all their text to image needs are well covered by Google Images.

Perhaps it&#x27;s the combination of AI being so overhyped in the general public plus media that&#x27;s already inundated with CGI, that it just doesn&#x27;t blow them away?

I’ve had a lot of fun playing with Disco Diffusion prompts, but I agree that the people excited about “a generation of prompt artists” are a bit misguided. Soon an AI will emerge that can come up with “better” prompts than you, and the “art” of creating prompts will have a lower skill ceiling.

To me, it paves the way for creative prototyping. I don&#x27;t see this as a zero-sum game between artists and AI. Instead, I could see artists using this for some serious time saving, and leveraging that extra time and energy for creating better results.

It could also be used for more nefarious reasons like disinformation campaigns though... it will be interesting to see what the next few years have in store

I&#x27;ve made perhaps overly absolutist statements like &quot;don&#x27;t you see! this kills artists jobs!&quot; and it was shrugged off as if I was insane. I probably could&#x27;ve phrased it differently, but to me this is game changing in several fields. Granted, it will open up a new field of &quot;generative artists&quot; but, having played with these things, this is a pretty trivial job, and their training nets are only going to get better.

I often think a similar thing about aliens. That is, instead of the panicking and hysteria or whatever that fiction imagines might accompany the discovery of aliens I fully expect that people will mostly go &quot;Oh, neat. Aliens.&quot; And go on with their lives.

Over a decade ago, Will Wright (of SimCity fame) faked conversational AGI robots in the streets and restaurants of Oakland. It consistently took people 2.4 seconds to go from “Oh look. The robots have arrived.” to “And, I’ll have fries with that.”Hollywood and the media have taught the public that tech is literally magic and can do literally anything. “Anything” is expected and pedestrian.

Well, I&#x27;m still in awe that I have a bunch of walls around me and can cover my body with clothes, or that I&#x27;m still alive after all this time, and that I can even rest most of the day and not spend body energy running after or from animals. Amazing stuff.A program that transforms text to an image? Huh.

I have shown imagen (and dalle2) to a number of people now (non-tech, just everyday friends, family, co-workers) and I have been pretty stunned by the response I get from most people:&quot;Meh, that&#x27;s kinda cool? I guess?&quot; or &quot;What am I looking at?&quot;...&quot;Ok? So a computer made it? That seems neat&quot;To me I am still trying to get my jaw off the floor from 2 months ago. But the responses have been so muted and shoulder shrugging that I think either I am missing something or they are missing something. Even really drilling in, practically shaking them &quot;DO YOU NOT UNDERSTAND THAT THIS IS A ORIGINAL IMAGE CONSTRUCTED ENTIRELY BY AN AI?!?!&quot; and people just seem to see it as a party trick at best.

The paper is very well explained and, reading this post, they seems to mostly make its content accessible to non domain expert.

The important part seems to be the diffusion model.Explanation linked from same page: <a href="https:&#x2F;&#x2F;www.assemblyai.com&#x2F;blog&#x2F;diffusion-models-for-machine-learning-introduction&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.assemblyai.com&#x2F;blog&#x2F;diffusion-models-for-machine...</a>

Google published these implementation details

Is this by a person that knows or is guessing?

I wonder how developers can monetise this?
What use cases does it have?

The value is in the data and the trained weights, the implementation is not where the bottleneck is in term of reproducing those models.Still great work from the author though, but we most definitely cannot say that imagen is released.

Are there any large publicly available models, ready to fine tune and deploy, that were trained on massive data sets?I really want to build services with these.

Wait, so I can try this on Colab right now?

This implementation popped up on hacker news not too long ago. I got it working on Colab first, and then my own GPU at home. But just barely. Need more memory :)<a href="https:&#x2F;&#x2F;github.com&#x2F;lucidrains&#x2F;imagen-pytorch" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;lucidrains&#x2F;imagen-pytorch</a>

&gt; Imagen, released just last month, can generate high-quality, high-resolution images given only a description of a scene“Released”? What? Papers are published. Websites are published. Tools are “released.”Where has Imagen been released?

I think we&#x27;re past a certain threshold, maybe not AGI but some definite qualitative change is happening.

Lots of people are saying that. I am saying that. OpenAI has it as a foundational mid-term goal.

I don&#x27;t think anyone is saying that humanity is close to AGI, but check out DeepMind&#x27;s Gato work for a more well-rounded agent:<a href="https:&#x2F;&#x2F;www.deepmind.com&#x2F;publications&#x2F;a-generalist-agent" rel="nofollow">https:&#x2F;&#x2F;www.deepmind.com&#x2F;publications&#x2F;a-generalist-agent</a>

If Google has something similar or better it definitely makes it look like OpenAI is wasting its time. None of this relates to AGI.

Unfortunately it seems like it&#x27;s greater than 0...If we ignore the procedurally generated NFTs created from mixing and matching various assets and go with ones where AI is the selling point, we&#x27;re left with a few notable ones: Sophia, a robot w&#x2F; some low-level AI sold a single piece for 689k USD [^1]. Botto, a VQGAN-based algorithm sold a single piece for 430k USD and has sold multiple other pieces for tens to hundreds of thousands of dollars. Slightly more modest are some other projects like Metascapes [^3] and Eponym [^4], which produced some really tedious pieces that managed to sell for 3.5k USD and 10k USD respectively. That said, the Eponym piece seems to be some sort of self promotion, so maybe we can say that the actual prices for these collections are somewhere in the fraction of an ETH range if they can be sold at all.Honestly, only the Botto piece is remotely interesting to look at, and even then I feel as if the blurred, &quot;dreamy&quot; aesthetic that seems to be in so many different AI painting approaches (style-transfer, VQGANS, DALL-E, maybe others I&#x27;m not aware of). I think it was more interesting back when we could pretend that these were the electric sheep at the fringes of some deep-sleeping latent intelligent potential but now they just feel kinda arbitrary and lacking deliberation. I absolutely love the field and think these researchers have done tremendous work, but I feel as though all the lay news attention is on the art, and not on the algorithm that generated it. The fascinating thing is that we have a machine that can produce novel something from words or basic ideas and that the output&#x27;s content retains these ideas, not so much that art itself has that much compositional or stylistic merit.[^1]: <a href="https:&#x2F;&#x2F;niftygateway.com&#x2F;itemdetail&#x2F;primary&#x2F;0xbe60d0a37ebde6f4ad22ceb311a28b3c53efe4e5&#x2F;1" rel="nofollow">https:&#x2F;&#x2F;niftygateway.com&#x2F;itemdetail&#x2F;primary&#x2F;0xbe60d0a37ebde6...</a>[^2]: <a href="https:&#x2F;&#x2F;superrare.com&#x2F;artwork-v2&#x2F;scene-precede-29922" rel="nofollow">https:&#x2F;&#x2F;superrare.com&#x2F;artwork-v2&#x2F;scene-precede-29922</a>[^3]: <a href="https:&#x2F;&#x2F;opensea.io&#x2F;assets&#x2F;ethereum&#x2F;0x75d639e5e52b4ea5426f2fb46959b9c3099b729a&#x2F;1420" rel="nofollow">https:&#x2F;&#x2F;opensea.io&#x2F;assets&#x2F;ethereum&#x2F;0x75d639e5e52b4ea5426f2fb...</a>[^4]: <a href="https:&#x2F;&#x2F;opensea.io&#x2F;assets&#x2F;ethereum&#x2F;0xaa20f900e24ca7ed897c44d92012158f436ef791&#x2F;591" rel="nofollow">https:&#x2F;&#x2F;opensea.io&#x2F;assets&#x2F;ethereum&#x2F;0xaa20f900e24ca7ed897c44d...</a>

What&#x27;s the highest price paid for an AI-generated image NFT?