Question about GGML_HIP_UMA performance #11960

cdrfvb · 2025-02-19T14:47:33Z

cdrfvb
Feb 19, 2025

Hi, I am trying out LLMs with multiple AMD MI50 cards. I did some compiles and try to compare the performance. I encountered a weird or unexpected behaviour.
First with a compile using ROCM, and different offloaded layers, it looks roughly as expected, increasing performance with increased number of offloaded layers:

bin/llama-bench -t 24 -m ~/models/DeepSeek-R1-Distill-Llama-8B-Q8_0/DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf -ngl 0,10,20,30,40,50,60,70 -ts 1/1/1/1 --main-gpu 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 ROCm devices:
Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 2: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 3: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64

model	size	params	backend	threads	ts	test	t/s
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	pp512	25.57 ± 0.36
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	tg128	12.33 ± 0.00
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	pp512	36.55 ± 0.49
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	tg128	15.39 ± 0.04
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	pp512	64.78 ± 0.68
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	tg128	20.84 ± 0.11
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	pp512	263.92 ± 15.06
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	tg128	30.56 ± 0.24
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	pp512	776.48 ± 0.98
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	tg128	47.92 ± 0.01
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	pp512	775.61 ± 1.33
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	tg128	47.69 ± 0.01
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	pp512	775.60 ± 1.19
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	tg128	47.56 ± 0.07
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	pp512	777.24 ± 1.03
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	tg128	48.22 ± 0.05

Then, compiling with GGML_HIP_UMA=1, I get the following, increasing pp512 with increased layers, but decreasing tg128:

bin/llama-bench -t 24 -m ~/models/DeepSeek-R1-Distill-Llama-8B-Q8_0/DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf -ngl 0,10,20,30,40,50,60,70 -ts 1/1/1/1 --main-gpu 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 ROCm devices:
Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 2: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 3: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64

model	size	params	backend	threads	ts	test	t/s
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	pp512	25.48 ± 0.17
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	tg128	12.32 ± 0.00
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	pp512	32.29 ± 0.64
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	tg128	6.69 ± 0.00
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	pp512	44.04 ± 0.52
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	tg128	4.63 ± 0.00
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	pp512	71.17 ± 0.87
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	tg128	3.56 ± 0.00
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	pp512	81.19 ± 0.31
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	tg128	3.26 ± 0.00
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	pp512	81.05 ± 0.33
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	tg128	3.26 ± 0.00
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	pp512	81.17 ± 0.49
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	tg128	3.26 ± 0.00
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	pp512	81.11 ± 0.37
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	tg128	3.26 ± 0.00

Is this effect to be expected? It looks weird.

IMbackK · 2025-02-19T19:17:44Z

IMbackK
Feb 19, 2025
Collaborator

uma mode uses hip runtime managed memory, runtime managed memory is only efficient when the device is in xnack+ mode. That said as the flag name suggests, using managed memory never makes any sense for llamacpp on a device with dedicated video memory

5 replies

cdrfvb Feb 19, 2025
Author

Thank you for the answer. So essentially with hip runtime managed memory, offloading to GPU slows down text generation in comparison to CPU usage? I get it. It seemed like a solution to an issue I have when using Deepseek R1 671b Q4 or Q5, which led to llama-cpp crashing with failed memory allocation after a few prompts, not during initialization, even if I force only one layer with 7GB per 16GB MI50 card. Using HIP_NUMA, the crashes were not happening, but performance was far from good. Is there any other thing I could try to prevent llama-cpp from crashing with OOM when offloading 7GB or 8GB to each card after a few (admittedly larger) prompts?

IMbackK Feb 19, 2025
Collaborator

hip runtime managed memory in xnack- mode has the gpu accesing the data in cpu ram over the pcie bus without using its vram at all, this is extremely slow so yes it is expected to be mutch slower than the cpu accessing its own memory.

cdrfvb Feb 19, 2025
Author

Oh, so it ignores the VRAM? That is a blow. I assumed that it would prefer VRAM first and only in case of memory shortage allocate on the main memory.
Is there any way to prevent the crashing due to OOM without HIP_UMA? With me it allocates 7Gig per card for the model, and a few gigs per card for buffers and cache, it seems to fit and run, and only after a while crashes because of additional allocations later after a few prompts.

IMbackK Feb 19, 2025
Collaborator

reduce the max length of the context. You could also enable xnack to make uma perform better, but i still would not recommend it.

cdrfvb Feb 19, 2025
Author

I guess xnack does not work with me...
bin/llama-bench -t 24 -m ~/models/DeepSeek-R1-Distill-Llama-8B-Q8_0/DeepSeek-R1-Distill-Llama-8B-Q8_0.gguf -ngl 0,10,20,30,40,50,60,70 -ts 1/1/1/1 --main-gpu 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 ROCm devices:
Device 0: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack+ (0x906), VMM: no, Wave Size: 64
Device 1: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack+ (0x906), VMM: no, Wave Size: 64
Device 2: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack+ (0x906), VMM: no, Wave Size: 64
Device 3: AMD Instinct MI50/MI60, gfx906:sramecc+:xnack+ (0x906), VMM: no, Wave Size: 64

model	size	params	backend	threads	ts	test	t/s
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	pp512	22.69 ± 0.20
llama 8B Q8_0	7.95 GiB	8.03 B	ROCm,BLAS,RPC	24	1.00/1.00/1.00/1.00	tg128	12.33 ± 0.00

git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:73: ROCm error
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about GGML_HIP_UMA performance #11960

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Question about GGML_HIP_UMA performance #11960

Uh oh!

cdrfvb Feb 19, 2025

Replies: 1 comment · 5 replies

Uh oh!

IMbackK Feb 19, 2025 Collaborator

Uh oh!

cdrfvb Feb 19, 2025 Author

Uh oh!

IMbackK Feb 19, 2025 Collaborator

Uh oh!

cdrfvb Feb 19, 2025 Author

Uh oh!

IMbackK Feb 19, 2025 Collaborator

Uh oh!

Uh oh!

cdrfvb Feb 19, 2025 Author

cdrfvb
Feb 19, 2025

Replies: 1 comment 5 replies

IMbackK
Feb 19, 2025
Collaborator

cdrfvb Feb 19, 2025
Author

IMbackK Feb 19, 2025
Collaborator

cdrfvb Feb 19, 2025
Author

IMbackK Feb 19, 2025
Collaborator

cdrfvb Feb 19, 2025
Author