> likely could outperform this setup in terms of tokens per second
I've heard arguments both for and against this, but they always lack concrete numbers.
I'd love something like "Here is Qwen2.5 at Q4 quantization running via Ollama + these settings, and M4 24GB RAM gets X tokens/s while RTX 3090ti gets Y tokens/s", otherwise we're just propagating mostly anecdotes without any reality-checks.
Fast prompt eval is important when feeding larger contexts into these models, which is required for almost anything useful. GPUs have other advantages for traditional ML, whisper models, vision, and image generation. There's a lot of flexibility that doesn't really get discussed when folks trot out the 'just buy a mac' line.
Anecdotally I can share my revealed preference. I have both an M3 (36gb) as well as a GPU machine, and I went through the trouble of putting my GPU box online because it was so much faster than the mac. And doubling up the GPUs allows me to run models like the deepseek-tuned llama 3.3, with which I have completely replaced my use of chatgpt 4o.
Thanks for numbers! People should include their LLM runner as well I think, as there are differences in hardware optimization support. Like I haven't tested it but I've heard MLX is noticeably faster than Ollama on Macs.
What quantization are you using? What's the runtime+version you run this with? And the rest of the settings?
Edit: Turns out parent is using Q4 for their test. Doing the same test with LM Studio and a 3090ti + Ryzen 5950X (with 44 layers on GPU, 2 on CPU) I get ~15 tokens/second.
When you run that, what quantization do you get? The library website of Ollama (https://ollama.com/library/gemma2:27b) isn't exactly a good use case in surfacing useful information like what the default quantization is.
If you hit the drop-down menu for the size of the model, then tap “view all”, you will see the size and hash of the model you have selected and can compare it to the full list below it that has the quantization specs in the name.
The model displayed in the drop-down when you access the web library is the default that will be pulled. Compare the size and hash to the more detailed model listing below it and you will see what quantization you have.
Example: the default model weights for Llama 3.3 70b, after hitting the “view all” have this hash and size listed next to it - a6eb4748fd29 • 43GB
Now scroll down through the list and you will find the one that matches that hash and size is “70b-instruct-q4_K_M”. That tells you that the default weights for Llama 3.3 70B from Ollama are 4-bit quantized (q4) while the “K_M” tells you a bit about what techniques were used during quantization to balance size and performance.
Thanks, that seems to indicate Q4 for the quantization, you're probably able to run that on the 4090 as well FWIW, the size of the model is just 14.55 GiB.
I think we are somewhat still at the “fuzzy super early adopter” stage of this local LLM game and hard data is not going to be easy to come by. I almost want to use the word “hobbiest stage” where almost all of the “data” and “best practice” is anecdotal but I think we are a step above that.
Still, it’s way to early and there are simply way to many hardware and software combinations that change almost weekly to establish “the best practice hardware configuration for training / inferencing large language models locally”.
Some day there will be established guides with solid. In fact someday there will be be PC’s that specifically target LLMs and will feature all kinds of stats aimed at getting you to bust out your wallet. And I even predict they’ll come up with metrics that all the players will chase well beyond when those metrics make sense (megapixels, clock frequency, etc)… but we aren’t there yet!
Yes, but we don't have enough people doing that to get quality data. Not many people are building this kind of setup, and even less are publishing their results. Additionally, if I just run a test a couple of time and then average the results, this is still far from a solid measurement.
> but we don't have enough people doing that to get quality data
But how are we supposed to get enough people doing those things if everyone say "There isn't enough data right now for it to be useful"? We have to start somewhere
Right, but how are we supposed to be getting anywhere else unless people start being more specific and stop leaning on anecdotes or repeating what they've heard elsewhere?
Saying "Apple seems to be somewhat equal to this other setup" doesn't really contribute to someone getting an accurate picture if it is equal or not, unless we start including raw numbers, even if they aren't directly comparable.
I don't think it's too early to say "I get X tokens/second with this setup + these settings" because then we can at least start comparing, instead of just guessing which seems to be the current SOTA.
These are good examples because the llama.cpp and whisper.cpp benchmarks take full advantage of the Apple hardware but also take full advantage of non-Apple hardware with GPU support, AVX support etc.
It’s been true for a while now that the memory bandwidth of modern Apple systems in tandem with the neural cores and gpu has made them very competitive Nvidia for local inference and even basic training.
I guess I'm mostly lamenting about how unscientific these discussions are in general, on HN and elsewhere (besides specific GitHub repositories). Every community is filled with just anecdotal stories, or some numbers but missing to specify a bunch of settings + model + runtime details so people could at least compare it to something.
as someone who is paying $0.50 per kwh, id also like to include kw per 1000 tokens or something to give me a sense of cost of ownership these local systems
That would be an awesome thing across the industry -- even for the big commercial models -- for those who care not only about price but also carbon footprint.
The studio has a lot more ram available to the GPU (up to 192gb) than the a100 (80gb), and iirc at least comparable memory bandwidth -- those are what matter when you're doing LLM inference, so the studio tends to win out there.
Where the a100 and other similar chips dominate is in training &c, which is mostly a question of flops.
I've heard arguments both for and against this, but they always lack concrete numbers.
I'd love something like "Here is Qwen2.5 at Q4 quantization running via Ollama + these settings, and M4 24GB RAM gets X tokens/s while RTX 3090ti gets Y tokens/s", otherwise we're just propagating mostly anecdotes without any reality-checks.