The stagnation has been very curious. They are part of a large & generally competent org, which otherwise has remained far ahead of the competition, like GPT-4. Except... for DALL-E 2, where it did not just stagnate for over a year (on top of its bizarre blindspots like garbage anime generation), but actually seemed to get worse. They have an experimental model of some sort that some people have access to, but even there, it's nothing to write home about compared to the best models like Parti or eDiff-I etc.
I suspect that they consider txt2img to be more of a curiosity now. Sure, it's transformative; it's going to upend whole markets (and make some people a lot of money in the process) - however, it's just producing images. Contrast with LLMs, which have already proven to be generally applicable in great many domains, and that if you squint, are probably capturing the basic mechanisms of thinking. OpenAI lost the lead in txt2img, but GPT-4 is still way ahead of every other LLM. It makes sense for them to focus pretty much 100% on that.
I find it curious because (a) if they don't care about text2image, why launch it as a service to begin with? (b) if they don't care now, why keep it up and let it keep consuming resources, human & GPU? (c) if they do still care, because as other models & services have demonstrated there's a ton of interest in text2image, why not invest the relatively minor amount of resources it would take to keep it competitive (look how few people work at Midjourney, or are authors on imagegen papers)? It may have cost >$100m to make GPT-4, but making a decent imagegen model costs a lot less than that! (Even now, you could probably create a SOTA model for <$10m, especially if you have GPT-3/4 available for the text encoding part.)
But launching it and then just letting it stagnate indefinitely and get worse every day compared to its increasingly popular competitors seems like the worst of all worlds, and I can't see what is the OA strategy there.
Maybe they keep it up just so that they have something in txt2img space? It may not be the best, or even good, but you don't know that until you try it, and until then, it just enhances the value of the OpenAI platform. E.g. if you're building something backed by OpenAI LLMs, and are thinking about future txt2img integration, the existence of Dall-E might stop you from "shopping around" txt2img services in advance.
The way I see it, they don't need txt2img at this moment - GPT-4 ensures they're the top #1 name both in the industry and in AI-related news stories. But it doesn't mean they won't come back to it. Couple observations:
- OpenAI isn't a "release early, release often" shop. They might be already working on something, but they'll release it only when it is a qualitative improvement over everyone else (or at least Dall-E).
- A bunch of hobbyists is doing all their work for free anyway. Stable Diffusion itself may not be SOTA, but the totality of hundreds of different fine-tunes on Civitai very much is. With all those models being shared in the open and relatively easy/cheap to recreate, it would make sense for OpenAI to just stand by and watch, and only invest resources once hobbyists hit a plateau.
- Looking at those Civitai models, it seems to me that OpenAI could beat txt2img SOTA easily, at any moment, by taking (or re-creating, depending on the license) the best five to ten SD derivatives, and put them behind GPT-4, or even GPT-3.5, fine-tuned to 1) choose the best SD derivative for user's prompt, and 2) transform user's prompt to set of parameters (positive & negative prompts, diffuser algo, numeric params) crafted with choice from 1) in mind. It's a black box. On the Internet, no one can tell you're an ensemble model.
- They could even be doing it as we speak - addition of function calls is aligned with this direction, fine-tuning for good prompt generation is mostly a txt2txt exercise, and again, hobbyists around the world are busy building a high-quality human-curated data set of {what I want}x{model + positive prompt + negative prompt + diffuser + other params} -> {is this any good?}. If I were them, I'd just mine this and not say anything.
- Overall, I think that in txt2img space, currently the hard part isn't the "img" part, but the "txt" part. OpenAI has a huge advantage here, and as long as its true, they're in position to instantly overtake everyone else in this space. That is, they have an "Ultimate attack" charged and ready, and are patiently waiting for a good moment to trigger it.
- Didn't they hint that GPT-4 successor will be multimodal? That could end up being their comeback to txt2img. And img2txt. And a bunch of other modalities.
EDIT: As if on cue, the very thing I was speculating about above is being discussed wrt. LLMs right now:
Why do they need to have something in text2image? It in no way builds lockin to the API or anything, especially with how gimped it is.
1. Yes, they are. Look at the constant iterative rollouts of GPTs
2. Most of which is useless to them, not that they have made any use of it
3. the fact that it would be so easy to improve, and they haven't, only emphasizes my point.
4. sure, that could be useful. Except there's zero integration or mention. (They haven't even opened up the vision part of GPT-4 yet.)
5. the fact that it would be so easy to improve, and they haven't, only emphasizes my point.
6. why wait for GPT-5 possibly years from now?
> Why do they need to have something in text2image?
So they're "on the list". So whenever journalists and bloggers write articles about text2image, they're listed as a player in this space. For vast majority of such articles, neither the authors nor the audience will be able to tell that OpenAI's offering is far behind and that they're basically keeping a token presence in the space.
At least that's my hypothesis. I'm neither a domain expert or a business expert - I just feel that, for OpenAI, having laymen view them as an industry leader in AI in general, is worth the price of keeping Dall-E available. In fact, as more and more users realize there are better models available elsewhere, that price goes down, while the effect on laymen audience stays the same.
(Note: the term "laymen", as I use it here, specifically includes most entrepreneurs, managers and investors, in tech or otherwise. If I'm being honest in myself, I belong to that category too; it's in fact this conversation and some recent threads that made me realize just how weak OpenAI is in image generation space.)
> Look at the constant iterative rollouts of GPTs
You mean some unannounced ones, or the pinned models? Because AFAIK GPT-3.5 had two updates after release (the turbo model and the current one), and GPT-4 had one. I mean public releases; for example, how often they updated GPT-4 back before it was public, e.g. when Microsoft was building Bing Chat, is not relevant in this context.
Also compare that with how, going by HN submissions alone, every other day someone releases some improved LLaMA-derived LLM.
> 2. Most of which is useless to them, not that they have made any use of it 3. the fact that it would be so easy to improve, and they haven't, only emphasizes my point. 4. sure, that could be useful. Except there's zero integration or mention. (...) 5. the fact that it would be so easy to improve, and they haven't, only emphasizes my point.
There's little for them to gain by openly using all that work now. At the moment, they can just keep an eye on what's posted to Civitai, paying particular attention to how different model derivatives respond to prompts (think e.g. CyberRealistic vs. Deliberate) and why, and build up a training corpus of prompts and settings, helpfully provided by the community, complete with quality rating. They can do that using a small fraction of resources they have available - so that when the time comes, they can use their full resources to quickly train and deploy a model that blows everyone else out of the water.
Also, as an organization, they can focus only on so many things at a time. GPT-4 is buying them some space, and I believe they're currently focusing primarily on their cooperation with Microsoft, and/or other things involving LLMs. Given the relative usefulness and potential of LLMs vs. image generation, both short and long-term, doing more than bare minimum in image generation right now might be too much of a distraction for an organization this size.
> (They haven't even opened up the vision part of GPT-4 yet.)
They're in the lead. They're not in a hurry. They're likely giving Microsoft a head start.
> 6. why wait for GPT-5 possibly years from now?
Why do it earlier? What could they possibly gain by jumping back into text2image space now? At this point, compared to LLMs, text2image seems neither profitable not particularly relevant for x-risk, so whichever way you cut it, I can't see why would they want to prioritize it.
Nobody is able to use Parti or eDiff. Compared to models you can use, the experimental Dall-e or Bing Image Creator is second only to midjourney in my experience.
Parti/eDiff show that it is relatively easy to do much better than the experimental model which presumably represents their best effort, never mind the hot garbage you see in OP from DALL-E 2. And it's not a calculated degree of low-quality enabled by those models being unreleased and having no competition, because competition like Stable Diffusion or Midjourney are beating the heck out of DALL-E 2 in popular usage.
I haven't tried those two, but I'd be surprised if they were better than Stable Diffusion. Which is free, runnable (and trainable!) locally, and already has a large ecosystem of frontends, tweaks and customized models.
Believe me, i know all about SD's possible customization and tweaks.
I would still easily put both ahead of the base models. You won't match the quality of those models without finetuning. When you do fine-tune, it'll be for a particular aesthetic and you won't match them in terms of prompt understanding and adherence.
I don't know, what I saw in there (particularly with the haunted house) was a far broader POTENTIAL RANGE of outputs. I get that they were cheesier outputs, but it seems to me that those outputs were just as capable of coming from the other 'AIs'… if you let them.
It's like each of these has a hidden giant pile of negative prompts, or additional positive prompts, that greatly narrow down the range of output. There are contexts where the Dall-E 'spoopy haunted house ooooo!' imagery would be exactly right… like 'show me halloweeny stock art'.
That haunted house prompt didn't explicitly SAY 'oh, also make it look like it's a photo out of a movie and make it look fantastic'. But something in the more 'competitive' AIs knew to go for that. So if you wanted to go for the spoopy cheesey 'collective unconscious' imagery, would you have to force the more sophisticated AIs to go against their hidden requirements?
Mind you if you added 'halloween postcard from out of a cheesey old store' and suddenly the other ones were doing that vibe six times better, I'd immediately concede they were in fact that much smarter. I've seen that before, too, in different Stable Diffusion models. I'm just saying that the consistency of output in the 'smarter' ones can also represent a thumb on the scale.
They've got to compete by looking sophisticated, so the 'Greg Rutkowskification' effect will kick in: you show off by picking a flashy style to depict rather than going for something equally valid, but less commercial.
It's not just about the haunted house. Just look at the DALLE-2 living room pictures closely. None of it makes any sense. And we're not even talking of subtle details, all of the first three pictures have a central object that the eye should be drawn to that's just a total mess. (The table that's being subsumed by a bunch of melting brown chairs in the first one, the i-don't-even-know-what that seems to be the second picture, and the whatever-this-is on the blue carpet.)
OpenAI screwed up that one by trying to control it. StableDiffusion on the other hand, gives me hope that AI can be high quality and open(not only in name).
Can't wait to have something like StableDiffusion but for LLMs.