Replies: 1 comment 5 replies
-
uma mode uses hip runtime managed memory, runtime managed memory is only efficient when the device is in xnack+ mode. That said as the flag name suggests, using managed memory never makes any sense for llamacpp on a device with dedicated video memory |
Beta Was this translation helpful? Give feedback.
5 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I am trying out LLMs with multiple AMD MI50 cards. I did some compiles and try to compare the performance. I encountered a weird or unexpected behaviour.
First with a compile using ROCM, and different offloaded layers, it looks roughly as expected, increasing performance with increased number of offloaded layers:
bin/llama-bench -t 24 -m ~/models/DeepSeek-R1-Distill-Llama-8B-Q8_0/DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf -ngl 0,10,20,30,40,50,60,70 -ts 1/1/1/1 --main-gpu 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 ROCm devices:
Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 2: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 3: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Then, compiling with
GGML_HIP_UMA=1
, I get the following, increasing pp512 with increased layers, but decreasing tg128:bin/llama-bench -t 24 -m ~/models/DeepSeek-R1-Distill-Llama-8B-Q8_0/DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf -ngl 0,10,20,30,40,50,60,70 -ts 1/1/1/1 --main-gpu 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 ROCm devices:
Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 2: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 3: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Is this effect to be expected? It looks weird.
Beta Was this translation helpful? Give feedback.
All reactions