If you have a GPU with >4GB of VRAM and you want to run this locally, here's a f...

kgwgk · on Aug 29, 2022

There are as well forks for the GPU in Apple’s M1 chips:

https://github.com/magnusviri/stable-diffusion

francisduvivier · on Aug 29, 2022

Anyone knows how fast this runs on an m1 macbook air?

bjtitus · on Aug 29, 2022

Takes anywhere from 30s-1.5m on my M1 Max.

BatteryMountain · on Aug 29, 2022

Nvidia GTX 1660 Super with 6GB of VRAM.

I do runs at 384px by 384px, with batch size of 1. Sampling method has almost no impact on memory. Using k_euler with 30 steps renders an image in 10 to 20 seconds. The biggest thing that affect rending speed is the steps and the resolution, so 512x512 with C 50 using ddim is much slower than 256x256 with C 25 using k_euler.

The sampling methods run mostly in the same timelines, but the k_euler one can produce viable output at lower C values, meaning it is faster than the rest.

Don't add gfpgan in the same pipeline, as it takes more vram.

I'm running it on Windows 10 with latest drivers. I set the python process to Realtime priority in task manager (makes a slight difference!). Have not tried it on Linux.

gitfan86 · on Aug 29, 2022

I'm running 1660 ti on Windows 11.

I'm thinking about getting a 3090 so that I can make higher resolution images.

Gfpgan runs much faster for me 5 seconds per picture

machineleaning · on Aug 29, 2022

What resolution could you get on a 3090?

BatteryMountain · on Aug 30, 2022

Good way to know (if not found online) is to start at 512x512 if you have a card with 12GB VRAM and try to increment it (the ui slider increases it by 64px for each increment) and backtrack when you start getting "cuda out of memory" errors. I've seen some renders on discord where the sizes are well above 1000px, so they must have had a 16/24GB cards or something similar. In the research context, they are used to using 40GB/80GB hardware (and perhaps multiples) to train and render. So quite remarkable that it works on consumer hardware at this point.

edit: on second thought, they most likely rendered at 512px but then ran it through an upscaler model. I've been meaning to hook mine up but kinda forgot to try.

gitfan86 · on Aug 29, 2022

No idea, things are changing very quickly now

jtolmar · on Aug 29, 2022

I'm impressed that running it on CPU only made it ~20x slower. How did you do it?

lkois · on Aug 29, 2022

Nah that's normal. It's why GPUs are the usual thing for AI. Any crap, old, weak gpu with 4gb memory would run circles around a cpu

It's often easier to actually get models to run on CPU, due to simpler install configs and more available memory. Just painful to get a result out of it. Which might help keep the install simple, because it's not even worth optimizing

kgwgk · on Aug 29, 2022

> Any crap, old, weak gpu with 4gb memory would run circles around a cpu

Not really.

https://news.ycombinator.com/item?id=32635086

I’m not even sure it works well - if at all - with 4Gb.

In any case, it’s impressive even if it takes minutes. And it’s not like you need to be there to make it work. You can create a list of prompts, let it do its thing and check the results later.

hwers · on Aug 29, 2022

I have 4 gb and it takes about 9 seconds for me :) (Tho at 448x448 but there’s no real difference in quality.)

kgwgk · on Aug 29, 2022

Good to know it works. That’s not a crap, old, weak gpu, I guess.

I’ve not tried that size but I tried 256x256 and it was too small to get interesting results - maybe there are some parameters that can be adjusted to improve it though.

stavros · on Aug 29, 2022

The GP is asking why it's only 20x slower, rather than more slow.

lkois · on Aug 29, 2022

Ha yep I see that now, on a reread. My bad

stavros · on Aug 29, 2022

It's OK, I missed the "only" the first time around and read it the same way as you.

acidburnNSA · on Aug 29, 2022

Just tried it on Ubuntu 22.04. And it's working! Had to install python-is-python3 and conda, but it's up now. Fun, thanks.

TuringTest · on Aug 29, 2022

How do you set up the model? Instructions only say "Download the model checkpoint. (e.g. from huggingface)", but I can't find instructions there on how to find a ckpt file, nor exactly what file should I look for.

nilsb · on Aug 29, 2022

You'll need this file: https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin... - Before you can download it you have to accept the T&C at https://huggingface.co/CompVis/stable-diffusion-v-1-4-origin...

nickthegreek · on Aug 29, 2022

Thank you, this saved me a lot of frustration.

asicsp · on Aug 29, 2022

Can you share your script for running on CPU?

ArneBab · on Aug 30, 2022

python optimizedSD/optimized_txt2img.py --device cpu --precision full --prompt "…" --H 512 --W 512 --n_iter 1 --n_samples 1 --ddim_steps 50

asicsp · on Aug 30, 2022

Thank you. There was a also a related submission for running on CPU: https://news.ycombinator.com/item?id=32642255

I'm planning to try it out this weekend. But not really hopeful. I only have 8GB ram, and mine is a cheap Intel CPU that doesn't even have AVX.

MichaelDickens · on Aug 29, 2022

What kind of GPU do you have? It takes several minutes to produce an image on my 1070.

elaus · on Aug 29, 2022

Takes 3 minutes (for a prompt resulting in a set of 4 images) on my 1080 as well. Really astonished that it takes GP about the same time using just a CPU. Seems like the older generation of GPUs isn't much better than CPUs in regards to ML stuff.

krisoft · on Aug 29, 2022

> Really astonished that it takes GP about the same time using just a CPU.

GP talks about generating a single image while you talk about generating 4.

stingraycharles · on Aug 29, 2022

A 1070 isn’t very powerful for ML compared to more recent GPUs, so several minutes sounds about right.

SirYandi · on Aug 29, 2022

To add another data point, my GTX1080 takes ~60 sec to generate a pair of 500x500 images using txt2img. Haven't tried img2img yet as the UI package I went with is a bit buggy with it

anhner · on Aug 29, 2022

also on a 1070, I can generate an image in ~15 seconds, surely you're doing something wrong.

freeqaz · on Aug 29, 2022

What settings are you using it the people with 1080s above you are taking 3 minutes?

anhner · on Aug 29, 2022

I'm using lstein's repository which loads the models and keeps them in memory. Then, for some diffusers 16 steps is enough to come up with a usable image (and using more steps will only add details, most of the time won't change much).

This is for the initial "exploration" step of the process. Once I like an image I typically play with the settings, then in the final step run with a large number of steps (and maybe even use upscaling).

So, default 512x512 size, 16 steps and default for the rest of the settings (I believe 7.5 scale, 0.75 strength).

Having said that, I also tried the official Docker image for stable diffusion and with the default values it generated an image in about 40 seconds.

bambax · on Aug 29, 2022

What are good and reasonably priced GPUs for this (<$250, possibly less)?

simcop2387 · on Aug 29, 2022

Not sure exact pricing but look for a used maxwell (geforce 1000 series) nvidia gpu i'd bet. A quadro m2000 with 4gb of ram was about 100 on ebay a short bit ago

bambax · on Aug 30, 2022

Thanks! A Quadro m2000 can still be found fr around 100 today, but I found an m4000 for 150, and ordered that... We'll see...

ilaksh · on Aug 29, 2022

Amazing. Anyone know of a fork where this is hosted on a cloud GPU? Or any existing hosting of this?

fragmede · on Aug 29, 2022

There’s a Colab implementation (Google hosted GPU) linked to from https://www.youtube.com/watch?v=Xur1JeRjjOI

outdoorblake · on Aug 29, 2022

here is a serverless GPU template for Stable Diffusion hosted on Banana's cloud platform. template: https://github.com/bananaml/serverless-template-stable-diffu... setup demo: https://www.banana.dev/blog/how-to-deploy-stable-diffusion-t...

nickthegreek · on Aug 29, 2022

Got this working on my 8gb 3070. 7-8s per image with default settings. Thanks for posting this!