It supports both txt2img and img2img. (Not affiliated.)
Edit: Incidentally, I tried running it on a CPU. It is possible, but it took 3 minutes instead of 10 seconds to produce an image. It also required me to hack up the script in a really gross way. Perhaps there is a script somewhere that properly supports this.
I do runs at 384px by 384px, with batch size of 1. Sampling method has almost no impact on memory. Using k_euler with 30 steps renders an image in 10 to 20 seconds. The biggest thing that affect rending speed is the steps and the resolution, so 512x512 with C 50 using ddim is much slower than 256x256 with C 25 using k_euler.
The sampling methods run mostly in the same timelines, but the k_euler one can produce viable output at lower C values, meaning it is faster than the rest.
Don't add gfpgan in the same pipeline, as it takes more vram.
I'm running it on Windows 10 with latest drivers. I set the python process to Realtime priority in task manager (makes a slight difference!). Have not tried it on Linux.
Good way to know (if not found online) is to start at 512x512 if you have a card with 12GB VRAM and try to increment it (the ui slider increases it by 64px for each increment) and backtrack when you start getting "cuda out of memory" errors. I've seen some renders on discord where the sizes are well above 1000px, so they must have had a 16/24GB cards or something similar.
In the research context, they are used to using 40GB/80GB hardware (and perhaps multiples) to train and render. So quite remarkable that it works on consumer hardware at this point.
edit: on second thought, they most likely rendered at 512px but then ran it through an upscaler model. I've been meaning to hook mine up but kinda forgot to try.
Nah that's normal. It's why GPUs are the usual thing for AI. Any crap, old, weak gpu with 4gb memory would run circles around a cpu
It's often easier to actually get models to run on CPU, due to simpler install configs and more available memory. Just painful to get a result out of it. Which might help keep the install simple, because it's not even worth optimizing
I’m not even sure it works well - if at all - with 4Gb.
In any case, it’s impressive even if it takes minutes. And it’s not like you need to be there to make it work. You can create a list of prompts, let it do its thing and check the results later.
Good to know it works. That’s not a crap, old, weak gpu, I guess.
I’ve not tried that size but I tried 256x256 and it was too small to get interesting results - maybe there are some parameters that can be adjusted to improve it though.
How do you set up the model? Instructions only say "Download the model checkpoint. (e.g. from huggingface)", but I can't find instructions there on how to find a ckpt file, nor exactly what file should I look for.
Takes 3 minutes (for a prompt resulting in a set of 4 images) on my 1080 as well. Really astonished that it takes GP about the same time using just a CPU. Seems like the older generation of GPUs isn't much better than CPUs in regards to ML stuff.
To add another data point, my GTX1080 takes ~60 sec to generate a pair of 500x500 images using txt2img. Haven't tried img2img yet as the UI package I went with is a bit buggy with it
I'm using lstein's repository which loads the models and keeps them in memory. Then, for some diffusers 16 steps is enough to come up with a usable image (and using more steps will only add details, most of the time won't change much).
This is for the initial "exploration" step of the process. Once I like an image I typically play with the settings, then in the final step run with a large number of steps (and maybe even use upscaling).
So, default 512x512 size, 16 steps and default for the rest of the settings (I believe 7.5 scale, 0.75 strength).
Having said that, I also tried the official Docker image for stable diffusion and with the default values it generated an image in about 40 seconds.
Not sure exact pricing but look for a used maxwell (geforce 1000 series) nvidia gpu i'd bet. A quadro m2000 with 4gb of ram was about 100 on ebay a short bit ago
Not necessarily, GPU compute can be non-deterministic due to scheduling: the result of A+B+C is subtly different when it's (A+B)+C or when it's A+(B+C), which can get amplified in a long processing pipeline
On the forked repo with the webui (https://github.com/hlky/stable-diffusion) it seems that the same inputs result in the same output, so maybe it has been resolved there?
old.reddit is truly horrible on mobile. Once you click on an image you can't go back. Off topic, but what is the other alternative UI called that people sometimes use?
I've been playing with this for a few hours. It's slow going -- you really need a fast GPU with a lot of RAM to make this very usable.
I ended up paying the $10 for Google Colab Pro and that's how I've been using this. Maybe I'll figure out how to get this working on my old 1080 TI to see if it's faster.
What I really wish was that the img2img tool could be used to take a text2img output and then "refine" it further. As it is, the img2img tool doesn't seem particularly great.
People on Reddit are talking about "I just generate 100 images and pick the best one"... but this is incredibly slow on the P100 GPU that Google has me on. Does this just require a monster GPU like a 3080/3090 in order to get any decent results?
Also how slow is your p100? I'm usually getting around 3 it/s. Maybe it's just because I'm used to disco diffusion where a single image took over an hour, but this is ungodly fast to me
FWIW I'm using an old gtx 1080 Ti to play around, it takes about 21 seconds per image. You can make it go even faster by lowering the timesteps taken from the default 50 (--ddim_steps), though both lowering and raising the value can result in quite different first-iteration images (though they tend to be similar) and seems to guarantee totally different further iteration images (as counted by --n_iter)... I'm with you on the feeling that it's hard to control, whether in refinement or in other ways, but I suspect that'll get a lot better in the next couple years (if not weeks or dare I say days).
You're probably using the default PLMS sampler with 50 steps. There are better samplers, the best seem to be Euler (more predictable in regards to the number of steps) and Euler ancestral (gives more variation). Both typically need much less steps to converge, speeding up the generation.
HuggingFace is a company that mainly builds open-source libraries and platforms to support open-source ML projects. They started out with their famous Transformers library and have many other libraries including the diffusion model one that is actually what this application here is using. They also have this model/dataset hub and interactive application platform known as "Spaces". Their goal is to be the "GitHub of machine learning".
Their business model is basically supporting enterprise and private use-cases. For example, getting expert support for using these libraries, or hosting models and datasets privately. You can see more information about the pricing here:
https://huggingface.co/pricing
They reached a $2 billion valuation after a recent round of funding so overall they're probably pretty flush with cash lol
complaining about the price of getting their images done. So part of it may actually exchanging money for a service. I bet a good chunk of it is investor cash though.
I've been trying to get some sensible images out of my descriptions, but I fail miserably.
In this case I had the prompt "cow chewing bone" with 4 squares representing the two pair of feet, the body and the head. None cared about chewing on a bone.
With DALL·E 2 I tried to get an image of a little girl building sandcastles and a monster threatening her:
"little scared girl building a sandcastle and a big angry monster is looking at her."
"little scared girl building a sandcastle six damaged sandcastles are to her side. a big angry monster is threatening her. it is dark." https://imgur.com/a/f5FFKOi
"little scared girl building a sandcastle with six damaged sandcastles to her side and a big angry monster threatening her"
Is there some kind of structure the sentences should follow?
Yes, checkout examples on lexica or use a prompt builder to help, like promptmania.
Also, most of the good ones you see online are cherry picked from hundreds of runs, so set your batch size too 1000 and go to bed! After that, people then tend to run some of the good results through img2img, also with a lot of variations produced from a single image. Finally, some people also run them at higher resolutions if they have enough VRAM, as smaller resolution can distort or generate rubbish. For the messed up faces, they run it through gfpgan a few times to get prettier faces. Other than that, it is pure luck (using random seeds) to figure out what works and what doesn't. Use the 2 sites above to help you improve your prompts.
Just know that if you let it run over night often, you will see it on your electricity bill. My GTX 1660 runs at max while rendering, which is 125W. Leaving it running over night can easily eat 2 to 6 kw's, depending on your system.
I managed to get one that was correct with "A little scared girl is building a sandcastle, while a monster is looking at her. Award-winning photograph.", but I couldn't figure out a phrasing where it wouldn't most of the time get confused thinking that the sandcastle is the monster, or that the girl is the monster.
Dalle is bad at being instructed to have an exact count of items in the picture. Ask for 6 kittens and you get 7 & each kitten will be much more "wrong" than a piture of a single kitten.
Dalle is is bad at positional prompts. Ask for somethi g to be in the top rightbhand corner and it will appear bottom centre
Thank you, but this notebook appears to use images generated from text prompts. I was interested in interpolating between two given images, without generating from text.
You would need to run CLIP and generate CLIP embeddings from the images first, then feed them into the samplers. If you give it the starting images and caption them yourself, then it would also work.
Seems like if you draw something and hit start, but get the queue error, you lose the image prompt you drew as the frame is replaced with the first steps of diffusion noise? None of the results have a composition anywhere close to my image prompt, presumably because I can't get a run on first attempt.
I'm not quite grasping how to use this. I tried uploading a photograph and erasing part of it. But instead of painting in the erased portion, it left the erased area blank and replaced my photograph with an entirely new image.
There you go. We were promised quite far fetched things like 'flying cars', 'time travel', 'life extension' and 'universal basic income'.
Instead we get Deep Learning AI's being used for generating and faking sentences, images, videos, voices, code and digital art all being trained on mountains of data in data centers all significantly contributing to the already burning up of the planet to no benefit and no efficient alternatives.
A dystopian future with deepfakes and easy fake news creation thanks to the technologists who helped create those things for others to create a new griftopia on top of that.
So you will own nothing, believe everything that you see on the internet, and be very happy.
If we had time travel, they’d say now someone can kill our granddads and rewrite history. If we had flying cars, they’d say now our flats are worthless because everyone can fly by and gaze. If we had universal basic income, they’d say something about it too.
It’s really nice that we only got AIs that draws pictures and bitcoin that helps with stolen money.
Tune out the news and look for a futurism themed news source[0], and you can get your techno-optimistic pipe dreams back, right now. What you describe as "we were promised" and "we got" is just you seeing a bit more of the world. It was filled with empty promises and horrors in the beforetimes too, today's flavor is nothing special - just in its specifics for today's culture. And there's also plenty of people doing great things in the past too, perhaps not promising those things, just living their life doing something they believe in, and getting results.
My point being, there's enough good things and bad things to fit any mood and worldview. Everyone can basically pick as they'd like.
You must be kidding it’s hard labor creating images and you are complaining that it’s going to be easier in the future? You are complaining that the skill gap is eradicated and it now only depends on your ideas to produce compelling work? That’s absolutely insane everything that takes labor intensive tasks and makes them basically free is the future. This is not dystopian it’s our future, a future in which you and your capabilities don’t matter. It’s exactly the future we need and doing it opensource is exactly the way to keep it out of harms way of turbo capitalism. You are missing the point so hard it’s not even funny. Flying cars for the 1% is dystopian.
The skill gap was what made it valuable. There is not that much value in flooding everyone with generated images that have no story, no person behind it, no effort, no reason to exist, out of place. “Art” is as much about the final image as it is about the person who made it, why she made it and how she got there.
Also there is pretty dystopian angle because these tools balantly stole all the work of artists by “learning” on their work (without pemission). Calling it learning is too nice we all know its just huge visual pattern copy mashup paste machine. People are not gonna invent with this - they just add name of artist they like to the prompt and get to call that their work.
Its not liberation - its exploitation. Its gonna destroy peoples lives by making their hard earned skills obsolete.
Then again it probably wont be so bad. Artists wont disappear and people wont suddenly become artists just because they can write into prompt.
You in 1894: Why does no one think about all the horses and stable boys.
The value of art is also not in the hours you put into creating it it's about the idea behind it and how it's conveyed. Artists are also always stealing especially concept artists the only thing to say is "photobashing". You talk like someone who has absolutely no idea how the sausage is made.
Yes, yes and invention of camera will kill painters. Sure everybody knows this.
But there is important difference between someone stealing and some algorithm copy replicating anything from the past in instant. One is a remix that brings something new (even if author doesn't want) the other in static its conservation. It will create side effects that will impact our (visual) culture. But who knows what those will be.
Not sure why i deserve so much toxicity. We obviously work in different fields and have different view of what art actually is. No i am not happy to jump on AI bandwagon that will surely mean even faster pace and more precarious working conditions because individuals surely wont reap the benefits of automation. It's easy to not care when you are not affected.
None of that mean that i deserve the hate. It's just a different opinion.
If I want to run Stable diffusion in C++ only? I would like to build this program statically. I have been searching on github but so far unable to find anyone having ported this.
I typed in "space nerds in space fighting the zarkons", And it gave me a picture that looked similar to NASA's astronaut group photos, with 3 astronauts, 2 human, and one 3-eyed ninja-turtle looking alien in a spacesuit. Behind them was a space scene with a huge, earthlike planet. The faces of the two human astronauts looked eerily similar to the face of a friend of mine, a friend who happens to be the 2nd most prolific contributer of code to my open source game called "Space Nerds in Space". One of the human astronauts was bald, the other had a mullet, and the long portion of their hair was not entirely unlike my own hair.
Why were you disappointed? I haven't tried that with img2img, but as just a regular text prompt without any fancy prompt engineering I get results in line with what I'd expect (kind of a Hugo Boss/Cowboy crossover, e.g. this: https://imgur.com/a/gcKi3WG)
I'm one of the maintainers (in charge of the UI) for hlky webui repo!
And we just updated our own colab, you can find it here :
https://github.com/altryne/sd-webui-colab
I'm not gonna lie, I'm quite disappointed by the wilful obstinance on HN whenever there is a discussion of anti-NSFW or anti-racism filters that are present on today's artificial intelligence research.
First, the filters are only on the interface. The research is public. That's why there has been an explosion of new implementations. You are welcome to run the code yourself and make the horniest model you like.
But more importantly, this comes up every time, and everyone acts SO confused. But why? Artificial Intelligence is MASSIVE mainstream news. New advancements are arriving daily.
Do you want the news stories to be dominated by whatever heinous shit some bad-faith giggly teenager with a call of the void to offend as much as possible (i know, i was one) is able to generate?
Do you want the discourse to be dominated with "Won't someone think of the children?" blocking legitimate research and progress?
Europeans always like to dunk on Americans for being "prudes", but only a tiny section of Western Europe is progressive enough to not mind random nipples on their bus ads and television. All of Eastern Europe, Africa, India, and China are culturally still fairly conservative about sex - at least out in the open.
People think that preventing racist or sexual use of these models is indulging the prudish mores of America. But that itself is a very narrow perspective that ignores the perspectives of billions of people in the world.
I can generate many violent prompts right now, and the model knows the face of hundreds of celebrities (you would think they would garble those to prevent PR disasters). It knows what the 9/11 attack was, it knows about nuclear bombs, horror, gore. Yet somehow it doesn't know what a blowjob is. This is 100% about narrow-minded, prudish conservative values. And yes, most of them come from the US right now, because these models are not being trained in Africa, India or China.
Your argument is entirely descriptive. We know why they do this. What should be argued here is how stupid and sad it is to block progress due to nothing more than religious or moral values.
Tell me, earnestly, what progress is being blocked by your inability to use AI to generate an image of a blowjob.
I don't mean this in a moralistic sense. I have no qualms about images of blowjobs. I have no doubt that the porn industry has already deployed these models and is experimenting with this without filters. As a monetizable way to reduce the human costs, it's an entirely logical step, and adult performers should be as nervous as artists, and talking to lawyers.
It's funny because you spell out a use case that can represent undeniable progress: artificial porn. Porn with no component of human suffering, yet realistic and personalized. A dream product, no doubt. By what metric is this not progress, except a puritanical/religious one?
Moreover, sexuality is perhaps the most important theme in art, historically. It is a very legitimate thing to want to include in a tool such as this one. Its censorship is akin to "moral codes" of the past, completely regressive.
I'm not as sure that artificial porn is undeniable progress.
While I am entirely pro liberty in terms of consumption of pornography, it is not a replacement for human interaction, and can be used in conjunction with other tools to dehumanize and isolate individuals. Some individuals that then make negative contributions to the rest of society at large. These are not isolated incidents, and they've been rising: https://en.wikipedia.org/wiki/Misogynist_terrorism
There are many kinds of porn - some that exploit women, some that empower them in their production. And there are also many kinds of products - ones that focus on a human connection (whether vanilla or kinky) and ones that don't.
I would absolutely worry that artificial porn would be good enough to meet some of these demands but not most, and would ultimately be a net negative for society.
> Moreover, sexuality is perhaps the most important theme in art, historically.
I mean, it's demonstrably not, religious art imagery dominates by volume. But that's also not necessarily important - that's just who happened to be patrons of arts and had the means to commission them.
I'm not gonna deny that humans are horny and want to make lots of sexy art.
But you're also not going to get porn on basic cable. You have to seek that out extra for yourself, and that's going to be the case with AI-generated porn art also.
I think people can both understand the motives of the companies and even choose to do the same themselves, but still be disappointed in the state of society where this is required.
I agree with you that this probably is required while the tech is new and then eventually won't be when everyone and their dog can run these things locally on their phone.
>Do you want the news stories to be dominated by whatever heinous shit some bad-faith giggly teenager with a call of the void to offend as much as possible (i know, i was one) is able to generate?
It’s not the only alternative. It’s the alternative that will absolutely will happen eventually. But, will happen later and to a lesser degree because of all of the crude, annoying, imperfect filters that have been holding it back so far.
Stable Diffusion was released for free and open source with a totally unenforceable license (dont do anything illegal or unethical). People right now are likely generating the most heinous of things. Some piece of shit bully is almost certainly as we speak taking pictures of their target and making the worst compromising cruel thing imaginable.
That discussion is over. You shouldn't care about a company trying to filter out bad shit on their own platform.
If you don't have anti-NSFW stuff your service will generate 95% porn as users who want it flock to you. No one wants a reputation as the porn AI (although I'm sure it will exist eventually).
I think of this like a "first post" filter, to cut down on the noise. Drawing dicks on things is something a lot of kids do. You want the trolls to have to work a little.
It's interesting that the people who write things like "look at this stream of words, they have to be coming from a sentient being" do not seem to care much about the "intelligence" generating these images.
https://github.com/hlky/stable-diffusion
It supports both txt2img and img2img. (Not affiliated.)
Edit: Incidentally, I tried running it on a CPU. It is possible, but it took 3 minutes instead of 10 seconds to produce an image. It also required me to hack up the script in a really gross way. Perhaps there is a script somewhere that properly supports this.