ggml_cuda_init: found 2 ROCm devices (Total VRAM: 56269 MiB):
Device 0: AMD Radeon RX 7900 XTX, gfx1100 (0x1100), VMM: no, Wave Size: 32, VRAM: 24560 MiB
Device 1: AMD Ryzen 5 7600X 6-Core Processor, gfx1036 (0x1036), VMM: no, Wave Size: 32, VRAM: 31709 MiB
load_backend: failed to find ggml_backend_init in /workspace/llama-cpp-turboquant/build_local/bin/libggml-hip.so
load_backend: failed to find ggml_backend_init in /workspace/llama-cpp-turboquant/build_local/bin/libggml-cpu.so
build_info: b8983-9e3fb40e8
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 11 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model '/workspace/models/lmstudio-community/gemma-4-31B-it-GGUF/gemma-4-31B-it-Q4_K_M.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected memory use with initial parameters [MiB]:
llama_params_fit_impl: - ROCm0 (AMD Radeon RX 7900 XTX) : 24560 total, 8501 used, 15846 free vs. target of 1024
llama_params_fit_impl: - ROCm1 (AMD Ryzen 5 7600X 6-Core Processor): 31709 total, 13963 used, 25389 free vs. target of 1024
llama_params_fit_impl: projected to use 22465 MiB of device memory vs. 63701 MiB of free device memory
llama_params_fit_impl: targets for free memory can be met on all devices, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 1.86 seconds
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon RX 7900 XTX) (0000:03:00.0) - 24348 MiB free
llama_model_load_from_file_impl: using device ROCm1 (AMD Ryzen 5 7600X 6-Core Processor) (0000:0f:00.0) - 39353 MiB free
llama_model_loader: loaded meta data with 43 key-value pairs and 833 tensors from /workspace/models/lmstudio-community/gemma-4-31B-it-GGUF/gemma-4-31B-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 64
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 5: general.name str = Gemma 4 31B
llama_model_loader: - kv 6: general.finetune str = it
llama_model_loader: - kv 7: general.size_label str = 31B
llama_model_loader: - kv 8: gemma4.block_count u32 = 60
llama_model_loader: - kv 9: gemma4.context_length u32 = 262144
llama_model_loader: - kv 10: gemma4.embedding_length u32 = 5376
llama_model_loader: - kv 11: gemma4.feed_forward_length u32 = 21504
llama_model_loader: - kv 12: gemma4.attention.head_count u32 = 32
llama_model_loader: - kv 13: gemma4.attention.head_count_kv arr[i32,60] = [16, 16, 16, 16, 16, 4, 16, 16, 16, 1...
llama_model_loader: - kv 14: gemma4.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 15: gemma4.rope.freq_base_swa f32 = 10000.000000
llama_model_loader: - kv 16: gemma4.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 17: gemma4.attention.key_length u32 = 512
llama_model_loader: - kv 18: gemma4.attention.value_length u32 = 512
llama_model_loader: - kv 19: gemma4.final_logit_softcapping f32 = 30.000000
llama_model_loader: - kv 20: gemma4.attention.sliding_window u32 = 1024
llama_model_loader: - kv 21: gemma4.attention.shared_kv_layers u32 = 0
llama_model_loader: - kv 22: gemma4.embedding_length_per_layer_input u32 = 0
llama_model_loader: - kv 23: gemma4.attention.sliding_window_pattern arr[bool,60] = [true, true, true, true, true, false,...
llama_model_loader: - kv 24: gemma4.attention.key_length_swa u32 = 256
llama_model_loader: - kv 25: gemma4.attention.value_length_swa u32 = 256
llama_model_loader: - kv 26: gemma4.rope.dimension_count u32 = 512
llama_model_loader: - kv 27: gemma4.rope.dimension_count_swa u32 = 256
llama_model_loader: - kv 28: tokenizer.ggml.model str = gemma4
llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,262144] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 30: tokenizer.ggml.scores arr[f32,262144] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 31: tokenizer.ggml.token_type arr[i32,262144] = [3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 32: tokenizer.ggml.merges arr[str,514906] = ["\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n", ...
llama_model_loader: - kv 33: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 34: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 35: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 37: tokenizer.ggml.mask_token_id u32 = 4
llama_model_loader: - kv 38: tokenizer.chat_template str = {%- macro format_parameters(propertie...
llama_model_loader: - kv 39: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: general.quantization_version u32 = 2
llama_model_loader: - kv 42: general.file_type u32 = 15
llama_model_loader: - type f32: 422 tensors
llama_model_loader: - type q4_K: 355 tensors
llama_model_loader: - type q6_K: 56 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 17.39 GiB (4.87 BPW)
load: 0 unused tokens
load: control-looking token: 212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
load: control-looking token: 50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
load: printing all EOG tokens:
load: - 1 ('<eos>')
load: - 50 ('<|tool_response>')
load: - 106 ('<turn|>')
load: - 212 ('</s>')
load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
load: special tokens cache size = 24
load: token to piece cache size = 1.9445 MB
print_info: arch = gemma4
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 5376
print_info: n_embd_inp = 5376
print_info: n_layer = 60
print_info: n_head = 32
print_info: n_head_kv = [16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4, 16, 16, 16, 16, 16, 4]
print_info: n_rot = 512
print_info: n_swa = 1024
print_info: is_swa_any = 1
print_info: n_embd_head_k = 512
print_info: n_embd_head_v = 512
print_info: n_gqa = [2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8, 2, 2, 2, 2, 2, 8]
print_info: n_embd_k_gqa = [4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048]
print_info: n_embd_v_gqa = [4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048, 4096, 4096, 4096, 4096, 4096, 2048]
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 1.0e+00
print_info: n_ff = 21504
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: freq_base_swa = 10000.0
print_info: freq_scale_swa = 1
print_info: n_embd_head_k_swa = 256
print_info: n_embd_head_v_swa = 256
print_info: n_rot_swa = 256
print_info: n_ctx_orig_yarn = 262144
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: model type = ?B
print_info: model params = 30.70 B
print_info: general.name = Gemma 4 31B
print_info: vocab type = BPE
print_info: n_vocab = 262144
print_info: n_merges = 514906
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 1 '<eos>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: MASK token = 4 '<mask>'
print_info: LF token = 107 '
'
print_info: EOG token = 1 '<eos>'
print_info: EOG token = 50 '<|tool_response>'
print_info: EOG token = 106 '<turn|>'
print_info: max token length = 93
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 59 repeating layers to GPU
load_tensors: offloaded 61/61 layers to GPU
load_tensors: CPU_Mapped model buffer size = 1102.50 MiB
load_tensors: ROCm0 model buffer size = 6681.53 MiB
load_tensors: ROCm1 model buffer size = 11124.82 MiB
............................................................................................
common_init_result: added <eos> logit bias = -inf
common_init_result: added <|tool_response> logit bias = -inf
common_init_result: added <turn|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 65536
llama_context: n_ctx_seq = 16384
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (16384) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 4.00 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 16384 cells
llama_kv_cache: ROCm0 KV buffer size = 744.00 MiB
llama_kv_cache: ROCm1 KV buffer size = 1116.00 MiB
llama_kv_cache: size = 1860.00 MiB ( 16384 cells, 10 layers, 4/4 seqs), K (q8_0): 1360.00 MiB, V (turbo3): 500.00 MiB
llama_kv_cache: upstream attention rotation disabled (TurboQuant uses kernel-level WHT)
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 512
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 512
llama_kv_cache_iswa: creating SWA KV cache, size = 1536 cells
llama_kv_cache: ROCm0 KV buffer size = 697.50 MiB
llama_kv_cache: ROCm1 KV buffer size = 1046.25 MiB
llama_kv_cache: size = 1743.75 MiB ( 1536 cells, 50 layers, 4/4 seqs), K (q8_0): 1275.00 MiB, V (turbo3): 468.75 MiB
llama_kv_cache: upstream attention rotation disabled (TurboQuant uses kernel-level WHT)
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
llama_context: pipeline parallelism enabled
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve: ROCm0 compute buffer size = 378.57 MiB
sched_reserve: ROCm1 compute buffer size = 634.58 MiB
sched_reserve: ROCm_Host compute buffer size = 161.09 MiB
sched_reserve: graph nodes = 2642
sched_reserve: graph splits = 3
sched_reserve: reserve took 305.87 ms, sched copies = 4
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/workspace/llama-cpp-turboquant/ggml/src/ggml-cuda/ggml-cuda.cu:99: ROCm error
ggml_cuda_compute_forward: SCALE failed
ROCm error: invalid device function
current device: 0, in function ggml_cuda_compute_forward at /workspace/llama-cpp-turboquant/ggml/src/ggml-cuda/ggml-cuda.cu:2997
err
[New LWP 262859]
[New LWP 262858]
[New LWP 262857]
[New LWP 262856]
[New LWP 262855]
[New LWP 262854]
[New LWP 262853]
[New LWP 262852]
[New LWP 262851]
[New LWP 262850]
[New LWP 262849]
[New LWP 262848]
[New LWP 262847]
[New LWP 262846]
[New LWP 262840]
This GDB supports auto-downloading debuginfo from the following URLs:
<https://debuginfod.fedoraproject.org/>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fb064e89422 in __syscall_cancel_arch () from /lib64/libc.so.6
#0 0x00007fb064e89422 in __syscall_cancel_arch () from /lib64/libc.so.6
#1 0x00007fb064e7d71c in __internal_syscall_cancel () from /lib64/libc.so.6
#2 0x00007fb064e7d764 in __syscall_cancel () from /lib64/libc.so.6
#3 0x00007fb064eedc0f in wait4 () from /lib64/libc.so.6
#4 0x00007fb06be7e37d in ggml_print_backtrace () at /workspace/llama-cpp-turboquant/ggml/src/ggml.c:219
219 waitpid(child_pid, NULL, 0);
#5 0x00007fb06be7e506 in ggml_abort (file=0x7fb0654d168a "/workspace/llama-cpp-turboquant/ggml/src/ggml-cuda/ggml-cuda.cu", line=99, fmt=0x7fb0654ff9b2 "ROCm error") at /workspace/llama-cpp-turboquant/ggml/src/ggml.c:253
253 ggml_print_backtrace();
#6 0x00007fb0656a85a2 in ggml_cuda_error (stmt=0x7fb0654c7f28 "err", func=func@entry=0x7fb06547213c "ggml_cuda_compute_forward", file=0x7fb0654d168a "/workspace/llama-cpp-turboquant/ggml/src/ggml-cuda/ggml-cuda.cu", line=line@entry=2997, msg=0x7fb05bcd3858 "invalid device function") at /workspace/llama-cpp-turboquant/ggml/src/ggml-cuda/ggml-cuda.cu:99
99 GGML_ABORT(GGML_CUDA_NAME " error");
#7 0x00007fb0656b18bc in ggml_cuda_compute_forward (ctx=..., dst=<optimized out>) at /workspace/llama-cpp-turboquant/ggml/src/ggml-cuda/ggml-cuda.cu:2997
2997 CUDA_CHECK(err);
#8 ggml_cuda_graph_evaluate_and_capture (cuda_ctx=0x4ccf25a0, cgraph=cgraph@entry=0x55a845e8, use_cuda_graph=false, cuda_graph_update_required=false, graph_key=<optimized out>) at /workspace/llama-cpp-turboquant/ggml/src/ggml-cuda/ggml-cuda.cu:4149
4149 bool ok = ggml_cuda_compute_forward(*cuda_ctx, node);
#9 0x00007fb0656addb1 in ggml_backend_cuda_graph_compute (backend=<optimized out>, cgraph=0x55a845e8) at /workspace/llama-cpp-turboquant/ggml/src/ggml-cuda/ggml-cuda.cu:4269
4269 ggml_cuda_graph_evaluate_and_capture(cuda_ctx, cgraph, use_cuda_graph, cuda_graph_update_required, graph_key);
#10 0x00007fb06be9a3e8 in ggml_backend_graph_compute_async (backend=0x529df300, cgraph=0x55a845e8) at /workspace/llama-cpp-turboquant/ggml/src/ggml-backend.cpp:452
452 return backend->iface.graph_compute(backend, cgraph);
#11 0x00007fb06be9ef86 in ggml_backend_sched_compute_splits (sched=0x4ca17f70) at /workspace/llama-cpp-turboquant/ggml/src/ggml-backend.cpp:1671
1671 enum ggml_status ec = ggml_backend_graph_compute_async(split_backend, &split->graph);
#12 0x00007fb06be9fdc7 in ggml_backend_sched_graph_compute_async (sched=0x4ca17f70, graph=0x4e9a5de0) at /workspace/llama-cpp-turboquant/ggml/src/ggml-backend.cpp:1894
1894 return ggml_backend_sched_compute_splits(sched);
#13 0x00007fb06b69cdd2 in llama_context::graph_compute (this=0x5661a8a0, gf=0x4e9a5de0, batched=true) at /workspace/llama-cpp-turboquant/src/llama-context.cpp:2191
2191 auto status = ggml_backend_sched_graph_compute_async(sched.get(), gf);
#14 0x00007fb06b6984c9 in llama_context::process_ubatch (this=0x5661a8a0, ubatch=..., gtype=LLM_GRAPH_TYPE_DECODER, mctx=0x3f420110, ret=@0x7ffc9dd46a2c: GGML_STATUS_SUCCESS) at /workspace/llama-cpp-turboquant/src/llama-context.cpp:1231
1231 const auto status = graph_compute(res->get_gf(), ubatch.n_tokens > 1);
#15 0x00007fb06b69a1d7 in llama_context::decode (this=0x5661a8a0, batch_inp=...) at /workspace/llama-cpp-turboquant/src/llama-context.cpp:1692
1692 const auto * res = process_ubatch(ubatch, LLM_GRAPH_TYPE_DECODER, mctx.get(), status);
#16 0x00007fb06b6a11f0 in llama_decode (ctx=0x5661a8a0, batch=...) at /workspace/llama-cpp-turboquant/src/llama-context.cpp:3479
3479 const int ret = ctx->decode(batch);
#17 0x0000000000688507 in common_init_from_params (params=...) at /workspace/llama-cpp-turboquant/common/common.cpp:1374
1374 llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch)));
#18 0x00000000004f9ebc in server_context_impl::load_model (this=0x4d778ac0, params=...) at /workspace/llama-cpp-turboquant/tools/server/server-context.cpp:749
749 llama_init = common_init_from_params(params_base);
#19 0x00000000004d3af8 in server_context::load_model (this=0x7ffc9dd4d878, params=...) at /workspace/llama-cpp-turboquant/tools/server/server-context.cpp:3067
3067 return impl->load_model(params);
#20 0x0000000000408c1e in main (argc=15, argv=0x7ffc9dd50648) at /workspace/llama-cpp-turboquant/tools/server/server.cpp:282
282 if (!ctx_server.load_model(params)) {
[Inferior 1 (process 262839) detached]
Aborted (core dumped)
Summary
TurboQuant crashes during model loading when using Gemma 4 31B. The error occurs with "invalid device function" in the SCALE operation.
System Info
Versions
OS: Fedora 42
ROCm: 6.3.1
TruboQuant: 9e3fb40e8bc0f873ad4d3d8329b17dacff28e4ca
Commands
Build
cmake -B build_local \ -DGGML_HIP=ON \ -DCMAKE_BUILD_TYPE=Debug \ -DGGML_HIP_ROCWMMA_FATTN=OFF \ -DCMAKE_INSTALL_RPATH='$ORIGIN' \ -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON \ -DCMAKE_SKIP_BUILD_RPATH=OFF \ -DGPU_TARGETS=gfx1100 cmake --build build_local --config Debug -jRun
Behaviour
Expected
It loads and start serving.
Actual
Crashes
Log
Click to open
Additional information
Upstream b8902 ROCm build without turboquant load successfully. I have also tried domvox's version but the failure is similar.