I love how well Intel's Arc iGPU and AMDs Strix Point iGPU are doing. I am plann...

replete · on Aug 25, 2024

I've noticed some BIOS' do not allow the full capacity of unified memory to be allocated, so if you do this check you can actually allocate 16GB, some are limited to 2 or 4GB, seemingly unnecessarily

ComputerGuru · on Aug 25, 2024

Apparently this is a legacy holdover and you should choose the smallest size in the bios. Fully unified memory is the norm, you don’t need to do the memory splitting that way.

aappleby · on Aug 25, 2024

You'll be limited by memory bandwidth more than compute.

imtringued · on Aug 25, 2024

Anyone who uses a CPU for inference is severely compute constrained. Nobody cares about tokens per second the moment inference is faster than you can read, but staring down a blank screen for 5 minutes? Yikes.

lhl · on Aug 25, 2024

Just as a point of reference, this is what a 65W power-limited 7940HS (Radeon 790M) with 64GB of DDR5-5600 looks like w/ a 7B Q4_K_M model atm w/ llama.cpp. While it's not amazing, at 240 t/s prefill, it means that at 4K context, you'll wait about 17 seconds before token generation starts, which isn't awful. The 890M should have about 20% better compute, so about 300 t/s prefill, and with LPDDR5-7500/8000, you should get to about 20 t/s.

  ./llama-bench -m /data/ai/models/llm/gguf/mistral-7b-instruct-v0.1.Q4_K_M.gguf
  ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
  ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon 780M, compute capability 11.0, VMM: no
  | model                          |       size |     params | backend    | ngl |          test |              t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
  | llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | ROCm       |  99 |         pp512 |    242.69 ± 0.99 |
  | llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | ROCm       |  99 |         tg128 |     15.33 ± 0.03 |

  build: e11bd856 (3620)

ComputerGuru · on Aug 25, 2024

> you'll wait about 17 seconds before token generation starts, which isn't awful

Let’s be honest, it might not be awful but it’s a nonstarter for encouraging local LLM adoption and most will prefer to pay to pay pennies for api access instead (friction aside).

lhl · on Aug 25, 2024

I don't know why anyone would think a meh performing iGPU would encourage local LLM adoption at all? A 7B local model is already not going to match frontier models for many use cases - if you don't care about using a local model (don't have privacy or network concerns) then I'd argue you probably should use an API. If you care about using a capable local LLM comfortably, then you should get as powerful a dGPU as your power/dollar budget allows. Your best bang/buck atm will probably be Nvidia consumer Ada GPUs (or used Ampere models).

However, if for anyone that is looking to use a local model on a chip with the Radeon 890M:

- look into implementing (or waiting for) NPU support - XDNA2's 50 TOPS should provide more raw compute than the 890M for tensor math (w/ Block FP16)

- use a smaller, more appropriate model for your use case (3B's or smaller can fulfill most simple requests) and of course will be faster

- don't use long conversations - when your conversations start they will have 0 context and no prefill; no waiting for context

- use `cache_prompt` for bs=1 interactive use you can save input/generations to cache

szundi · on Aug 25, 2024

For a lot of usecases it is actually awful

jodleif · on Aug 26, 2024

The problem is memory bandwith. There is a reason Apple Macbooks do relatively well with LLMs it’s not that the GPU is any better than zen5, but 4,5,6x memory bandwidth is huge (80ish gb/s vs 400gb/s)

aurareturn · on Aug 25, 2024

>Nobody cares about tokens per second the moment inference is faster than you can read, but staring down a blank screen for 5 minutes? Yikes.

I don't think so. Humans scan for keywords very often. No body really reads every word. Faster than reading speed inference is definitely beneficial.

brookst · on Aug 25, 2024

And thank you for making me conscious of my reading while reading your comment. May you become aware of your breathing.

dagmx · on Aug 25, 2024

Can they access the full RAM? Afaik they get capped to a portion of total available RAM.

But to your other point, very little of the current popular ML stack does more than CUDA and MPS. Some will do rocm but I don’t know if the AMD iGPUs are guaranteed to support it? There’s not much for Intel GPUs.

hedgehog · on Aug 25, 2024

It depends on the API used, whether the data is in the region considered "GPU memory" or whether it's shared with the compute API from the app's memory space. Support is somewhat in flux and I haven't been following closely but if you're curious this is my bookmarked jumping of point (a PyTorch ticket about this):

https://github.com/pytorch/pytorch/issues/107605

slavik81 · on Aug 25, 2024

My understanding is that as of Linux 6.10, the driver will now dynamically allocate more memory for the iGPU [1]. The driver team apparently reused a strategy that had been developed for MI300A.

I'm hoping that in combination with the gfx11-generic ISA introduced in LLVM 18, this will make it straightforward to enable compute applications on both Phoenix and Strix (even if they are not officially supported by ROCm).

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

lhl · on Aug 25, 2024

One issue is that even if you are using GTT (dynamically allocated memory), this is still limited as a percentage of total RAM. Eg, currently on my 7940HS, I have 64GB of memory, 8GB of dedicated VRAM (GART), and then a limit of 28GB of GTT - there is an amdgpu.gttsize parameter to "Restrict the size of GTT domain in MiB for testing. The default is -1 (It’s VRAM size if 3GB < VRAM < 3/4 RAM, otherwise 3/4 RAM size)", but I'm not sure what the practical/effective limits are.

dagmx · on Aug 25, 2024

FWIW, It’s one of the patches linked in their linked issue.

However the commits do have some caveats called out, as do the techniques they use to achieve the higher allocations.

dagmx · on Aug 25, 2024

I appreciate the link. The GTT and SDMA tricks mentioned there don’t really increase the shared ram use imho. They just increase the virtual memory the GPU can address, but with several tradeoffs in terms of allocation and copy operations.

As an aside, it just feels like a lot of hacks that AMD and Intel should have handled ages ago for their iGPUs instead of letting them languish.

sillystuff · on Aug 25, 2024

> Some will do rocm but I don’t know if the AMD iGPUs are guaranteed to support it?

If you only care about inference, llama.cpp supports Vulkan on any iGPU with Vulkan drivers. On my laptop with crap bios that does not allow changing any video ram settings, reserved "vram" is 2GB, but llama.cpp-vulkan can access 16GB of "vram" (half of physical ram). 16GB vram is sufficient to run any model that has even remotely practical execution speed on my bottom-of-the-line ryzen 3 3250U (Picasso/Raven 2); you can always offload some layers to CPU to run even larger.

(on Debian stable) Vulkan support:

  apt install libvulkan1 mesa-vulkan-drivers vulkan-tools

Build deps for llama.cpp:

  apt install libshaderc-dev glslang-dev libvulkan-dev

Build llama.cpp with vulkan back-end:

  make clean (I added this, in case you previously built with a diff back-end)

  make LLAMA_VULKAN=1

If more than one GPU: When running, you have to set GGML_VK_VISIBLE_DEVICES to the indices of the devices you want e.g.,

  export GGML_VK_VISIBLE_DEVICES=0,1,2

The indices correspond to the device order in

  vulkaninfo --summary.

By default llama.cpp will only use the first device it finds.

llama.cpp-vulkan has worked really well, for me. But, per benchmarks from back when Vulkan support was first released, using the CUDA back-end was faster than the Vulkan back-end on NVIDIA GPUs. Probably same Rocm vs Vulkan on AMD too. But, zero non-free / binary blobs required for Vulkan, and Vulkan supports more devices (e.g., my iGPU is not supported by Rocm)-- haven't tried, but you can probably mix GPUs from diff manufacturers using Vulkan.

guilamu · on Aug 25, 2024

Be careful, most bios will let you use only 1/4 of the total ram for the integrated GPU. Some - really bad - bios are even limiting to 2gb totally ignoring how much ram is available.

allen_fisher · on Aug 25, 2024

I set up both stable diffusion and LLMs on my desktop without Nvidia GPU. Everything goes well. Stable diffusion can run on onnx backend on my AMD GPU, and LLMs run through gguf format through ollama on CPU, model scale and speed are limited though.

jodleif · on Aug 26, 2024

The problem here is the slow memory… the iGPU is really already limited by slow ram, and with LLMs memory bandwidth is king