Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you have a GPU with >4GB of VRAM and you want to run this locally, here's a fork of the Stable Diffusion repo with a convenient web UI:

https://github.com/hlky/stable-diffusion

It supports both txt2img and img2img. (Not affiliated.)

Edit: Incidentally, I tried running it on a CPU. It is possible, but it took 3 minutes instead of 10 seconds to produce an image. It also required me to hack up the script in a really gross way. Perhaps there is a script somewhere that properly supports this.



There are as well forks for the GPU in Apple’s M1 chips:

https://github.com/magnusviri/stable-diffusion


Anyone knows how fast this runs on an m1 macbook air?


Takes anywhere from 30s-1.5m on my M1 Max.


Nvidia GTX 1660 Super with 6GB of VRAM.

I do runs at 384px by 384px, with batch size of 1. Sampling method has almost no impact on memory. Using k_euler with 30 steps renders an image in 10 to 20 seconds. The biggest thing that affect rending speed is the steps and the resolution, so 512x512 with C 50 using ddim is much slower than 256x256 with C 25 using k_euler.

The sampling methods run mostly in the same timelines, but the k_euler one can produce viable output at lower C values, meaning it is faster than the rest.

Don't add gfpgan in the same pipeline, as it takes more vram.

I'm running it on Windows 10 with latest drivers. I set the python process to Realtime priority in task manager (makes a slight difference!). Have not tried it on Linux.


I'm running 1660 ti on Windows 11.

I'm thinking about getting a 3090 so that I can make higher resolution images.

Gfpgan runs much faster for me 5 seconds per picture


What resolution could you get on a 3090?


Good way to know (if not found online) is to start at 512x512 if you have a card with 12GB VRAM and try to increment it (the ui slider increases it by 64px for each increment) and backtrack when you start getting "cuda out of memory" errors. I've seen some renders on discord where the sizes are well above 1000px, so they must have had a 16/24GB cards or something similar. In the research context, they are used to using 40GB/80GB hardware (and perhaps multiples) to train and render. So quite remarkable that it works on consumer hardware at this point.

edit: on second thought, they most likely rendered at 512px but then ran it through an upscaler model. I've been meaning to hook mine up but kinda forgot to try.


No idea, things are changing very quickly now


I'm impressed that running it on CPU only made it ~20x slower. How did you do it?


Nah that's normal. It's why GPUs are the usual thing for AI. Any crap, old, weak gpu with 4gb memory would run circles around a cpu

It's often easier to actually get models to run on CPU, due to simpler install configs and more available memory. Just painful to get a result out of it. Which might help keep the install simple, because it's not even worth optimizing


> Any crap, old, weak gpu with 4gb memory would run circles around a cpu

Not really.

https://news.ycombinator.com/item?id=32635086

I’m not even sure it works well - if at all - with 4Gb.

In any case, it’s impressive even if it takes minutes. And it’s not like you need to be there to make it work. You can create a list of prompts, let it do its thing and check the results later.


I have 4 gb and it takes about 9 seconds for me :) (Tho at 448x448 but there’s no real difference in quality.)


Good to know it works. That’s not a crap, old, weak gpu, I guess.

I’ve not tried that size but I tried 256x256 and it was too small to get interesting results - maybe there are some parameters that can be adjusted to improve it though.


The GP is asking why it's only 20x slower, rather than more slow.


Ha yep I see that now, on a reread. My bad


It's OK, I missed the "only" the first time around and read it the same way as you.


Just tried it on Ubuntu 22.04. And it's working! Had to install python-is-python3 and conda, but it's up now. Fun, thanks.


How do you set up the model? Instructions only say "Download the model checkpoint. (e.g. from huggingface)", but I can't find instructions there on how to find a ckpt file, nor exactly what file should I look for.



Thank you, this saved me a lot of frustration.


Can you share your script for running on CPU?


python optimizedSD/optimized_txt2img.py --device cpu --precision full --prompt "…" --H 512 --W 512 --n_iter 1 --n_samples 1 --ddim_steps 50


Thank you. There was a also a related submission for running on CPU: https://news.ycombinator.com/item?id=32642255

I'm planning to try it out this weekend. But not really hopeful. I only have 8GB ram, and mine is a cheap Intel CPU that doesn't even have AVX.


What kind of GPU do you have? It takes several minutes to produce an image on my 1070.


Takes 3 minutes (for a prompt resulting in a set of 4 images) on my 1080 as well. Really astonished that it takes GP about the same time using just a CPU. Seems like the older generation of GPUs isn't much better than CPUs in regards to ML stuff.


> Really astonished that it takes GP about the same time using just a CPU.

GP talks about generating a single image while you talk about generating 4.


A 1070 isn’t very powerful for ML compared to more recent GPUs, so several minutes sounds about right.


To add another data point, my GTX1080 takes ~60 sec to generate a pair of 500x500 images using txt2img. Haven't tried img2img yet as the UI package I went with is a bit buggy with it


also on a 1070, I can generate an image in ~15 seconds, surely you're doing something wrong.


What settings are you using it the people with 1080s above you are taking 3 minutes?


I'm using lstein's repository which loads the models and keeps them in memory. Then, for some diffusers 16 steps is enough to come up with a usable image (and using more steps will only add details, most of the time won't change much).

This is for the initial "exploration" step of the process. Once I like an image I typically play with the settings, then in the final step run with a large number of steps (and maybe even use upscaling).

So, default 512x512 size, 16 steps and default for the rest of the settings (I believe 7.5 scale, 0.75 strength).

Having said that, I also tried the official Docker image for stable diffusion and with the default values it generated an image in about 40 seconds.


What are good and reasonably priced GPUs for this (<$250, possibly less)?


Not sure exact pricing but look for a used maxwell (geforce 1000 series) nvidia gpu i'd bet. A quadro m2000 with 4gb of ram was about 100 on ebay a short bit ago


Thanks! A Quadro m2000 can still be found fr around 100 today, but I found an m4000 for 150, and ordered that... We'll see...


Amazing. Anyone know of a fork where this is hosted on a cloud GPU? Or any existing hosting of this?


There’s a Colab implementation (Google hosted GPU) linked to from https://www.youtube.com/watch?v=Xur1JeRjjOI


here is a serverless GPU template for Stable Diffusion hosted on Banana's cloud platform. template: https://github.com/bananaml/serverless-template-stable-diffu... setup demo: https://www.banana.dev/blog/how-to-deploy-stable-diffusion-t...


Got this working on my 8gb 3070. 7-8s per image with default settings. Thanks for posting this!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: