Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> likely could outperform this setup in terms of tokens per second

I've heard arguments both for and against this, but they always lack concrete numbers.

I'd love something like "Here is Qwen2.5 at Q4 quantization running via Ollama + these settings, and M4 24GB RAM gets X tokens/s while RTX 3090ti gets Y tokens/s", otherwise we're just propagating mostly anecdotes without any reality-checks.



On an M1 Max 64GB laptop running gemma2:27b same prompt and settings from blog post

    total duration:       24.919887458s
    load duration:        39.315083ms
    prompt eval count:    37 token(s)
    prompt eval duration: 963.071ms
    prompt eval rate:     38.42 tokens/s
    eval count:           441 token(s)
    eval duration:        23.916616s
    eval rate:            18.44 tokens/s
I have a gaming PC with a 4090 I could try, but I don't think this model would fit


On a 3090 (24gb vram), same prompt & quant, I can report more than double the tokens per second, and significantly faster prompt eval.

    total_duration:       10530451000
    load_duration:        54350253
    prompt_eval_count:    36
    prompt_eval_duration: 29000000
    prompt_token/s:       1241.38
    eval_count:           460
    eval_duration:        10445000000
    response_token/s:     44.04
Fast prompt eval is important when feeding larger contexts into these models, which is required for almost anything useful. GPUs have other advantages for traditional ML, whisper models, vision, and image generation. There's a lot of flexibility that doesn't really get discussed when folks trot out the 'just buy a mac' line.

Anecdotally I can share my revealed preference. I have both an M3 (36gb) as well as a GPU machine, and I went through the trouble of putting my GPU box online because it was so much faster than the mac. And doubling up the GPUs allows me to run models like the deepseek-tuned llama 3.3, with which I have completely replaced my use of chatgpt 4o.


Thanks for numbers! People should include their LLM runner as well I think, as there are differences in hardware optimization support. Like I haven't tested it but I've heard MLX is noticeably faster than Ollama on Macs.


> gemma2:27b

What quantization are you using? What's the runtime+version you run this with? And the rest of the settings?

Edit: Turns out parent is using Q4 for their test. Doing the same test with LM Studio and a 3090ti + Ryzen 5950X (with 44 layers on GPU, 2 on CPU) I get ~15 tokens/second.


Fresh install from brew, ollama version is 0.5.7

Only settings I did were the ones shown in the blog post

    OLLAMA_FLASH_ATTENTION=1
    OLLAMA_KV_CACHE_TYPE=q8_0
Ran the model like

    ollama run gemma2:27b --verbose
With the same prompt, "Can you write me a story about a tortoise and a hare, but one that involves a race to get the most tokens per second?"


When you run that, what quantization do you get? The library website of Ollama (https://ollama.com/library/gemma2:27b) isn't exactly a good use case in surfacing useful information like what the default quantization is.


If you leave the :27b off from that URL you'll see the default size which is 9b. Ollama seems to always use Q4_0 even if other quants are better.


not sure how to tell, but here's the full output from ollama serve https://pastes.io/ollama-run-gemma2-27b


If you hit the drop-down menu for the size of the model, then tap “view all”, you will see the size and hash of the model you have selected and can compare it to the full list below it that has the quantization specs in the name.


Still, I don't see a way (from the web library) to see the default quantization (from Ollama's POV) at all, is that possible somehow?


The model displayed in the drop-down when you access the web library is the default that will be pulled. Compare the size and hash to the more detailed model listing below it and you will see what quantization you have.

Example: the default model weights for Llama 3.3 70b, after hitting the “view all” have this hash and size listed next to it - a6eb4748fd29 • 43GB

Now scroll down through the list and you will find the one that matches that hash and size is “70b-instruct-q4_K_M”. That tells you that the default weights for Llama 3.3 70B from Ollama are 4-bit quantized (q4) while the “K_M” tells you a bit about what techniques were used during quantization to balance size and performance.


Thanks, that seems to indicate Q4 for the quantization, you're probably able to run that on the 4090 as well FWIW, the size of the model is just 14.55 GiB.


gemma2:27b-instruct-q4_0 (checksum 53261bc9c192)


7800X3D, 32GB DDR5, 4090:

    total duration:       10.5922028s
    load duration:        21.1739ms
    prompt eval count:    36 token(s)
    prompt eval duration: 546ms
    prompt eval rate:     65.93 tokens/s
    eval count:           467 token(s)
    eval duration:        10.023s
    eval rate:            46.59 tokens/s


I think we are somewhat still at the “fuzzy super early adopter” stage of this local LLM game and hard data is not going to be easy to come by. I almost want to use the word “hobbiest stage” where almost all of the “data” and “best practice” is anecdotal but I think we are a step above that.

Still, it’s way to early and there are simply way to many hardware and software combinations that change almost weekly to establish “the best practice hardware configuration for training / inferencing large language models locally”.

Some day there will be established guides with solid. In fact someday there will be be PC’s that specifically target LLMs and will feature all kinds of stats aimed at getting you to bust out your wallet. And I even predict they’ll come up with metrics that all the players will chase well beyond when those metrics make sense (megapixels, clock frequency, etc)… but we aren’t there yet!


> I think we are somewhat still at the “fuzzy super early adopter” stage of this local LLM game and hard data is not going to be easy to come by.

What's hard about it? You get the hardware, you run the software, you take measurements.


Yes, but we don't have enough people doing that to get quality data. Not many people are building this kind of setup, and even less are publishing their results. Additionally, if I just run a test a couple of time and then average the results, this is still far from a solid measurement.


> but we don't have enough people doing that to get quality data

But how are we supposed to get enough people doing those things if everyone say "There isn't enough data right now for it to be useful"? We have to start somewhere


I don't think they're saying anything counter to that. The people who don't require the volume of data will run these. Ie the super early adopters.


We've already started, we just haven't finished yet


Right, but how are we supposed to be getting anywhere else unless people start being more specific and stop leaning on anecdotes or repeating what they've heard elsewhere?

Saying "Apple seems to be somewhat equal to this other setup" doesn't really contribute to someone getting an accurate picture if it is equal or not, unless we start including raw numbers, even if they aren't directly comparable.

I don't think it's too early to say "I get X tokens/second with this setup + these settings" because then we can at least start comparing, instead of just guessing which seems to be the current SOTA.


A great thread with the type of info your looking for lives here: https://github.com/ggerganov/whisper.cpp/issues/89

But you can likely find similar threads for the llama.cpp benchmark here: https://github.com/ggerganov/llama.cpp/tree/master/examples/...

These are good examples because the llama.cpp and whisper.cpp benchmarks take full advantage of the Apple hardware but also take full advantage of non-Apple hardware with GPU support, AVX support etc.

It’s been true for a while now that the memory bandwidth of modern Apple systems in tandem with the neural cores and gpu has made them very competitive Nvidia for local inference and even basic training.


I guess I'm mostly lamenting about how unscientific these discussions are in general, on HN and elsewhere (besides specific GitHub repositories). Every community is filled with just anecdotal stories, or some numbers but missing to specify a bunch of settings + model + runtime details so people could at least compare it to something.

Still, thanks for the links :)


In fairness it’s become even more difficult now than ever before.

* hardware spec

* inference engine

* specific model - differences to tokenizer will make models faster/slower with equivalent parameter count

* quantization used - and you need to be aware of hardware specific optimizations for particular quants

* kv cache settings

* input context size

* output token count

This is probably not a complete list either.


Best place to get that kinda info is gonna be /r/LocalLlama


as someone who is paying $0.50 per kwh, id also like to include kw per 1000 tokens or something to give me a sense of cost of ownership these local systems


That would be an awesome thing across the industry -- even for the big commercial models -- for those who care not only about price but also carbon footprint.


Per the screenshot, this is a DeepSeek running on a 192GB M2 Studio https://nitter.poast.org/ggerganov/status/188461277009384272...

The same on Nvidia (various models) https://github.com/ggerganov/llama.cpp/issues/11474

[1] this is a the model: https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/De...


So Apple M2 Studio does ~15 tks/second and A100-SXM4-80GB does 9 tks/second?

I'm not sure I'm reading the results wrong or missing some vital context, but that sounds unlikely to me.


The studio has a lot more ram available to the GPU (up to 192gb) than the a100 (80gb), and iirc at least comparable memory bandwidth -- those are what matter when you're doing LLM inference, so the studio tends to win out there.

Where the a100 and other similar chips dominate is in training &c, which is mostly a question of flops.


> and iirc at least comparable memory bandwidth

I don't think they do.

From Wikipedia:

> the M2 Pro, M2 Max, and M2 Ultra have approximately 200 GB/s, 400 GB/s, and 800 GB/s respectively

From techpowerup:

> NVIDIA A100 SXM4 80 GB - Memory bandwidth - 2.04 TB/s

Seems to be a magnitude of difference, and that's just the bandwidth.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: