Skip to content

Unknown model architecture Gemma4 - Pull Gemma4 support from latest llama.cpp code #9

@timematcher

Description

@timematcher

Models such as "Qwen3-Coder-30B-A3B-Instruct-GGUF" load fine but the Gemma 4 (gemma-4-26B-A4B-it-GGUF) is giving error:

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4'

Compiled on Windows with CUDA 12 enabled

Environment details:

  • Windows 11 x64 Enterprise on Dell Precision 7780
  • CUDA 12.x
  • NVIDIA RTX 5000 Ada (16GB VRAM) Laptop GPU
  • System RAM: 64GB DDR5
  • CPU: I9-13950HX

Gemma4 support has recently been added in llama.cpp. Need to update the feature branch and merge your changes. - i think..

Full Log:

`llama-server.exe -m "H:\LMSTUDIOM_MODELs\lmstudio-community\gemma-4-26B-A4B-it-GGUF\gemma-4-26B-A4B-it-Q4_K_M.gguf" -ngl 99 --cache-type-k planar3 --cache-type-v planar3 --host 127.0.0.1 --port 8080
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16375 MiB):
Device 0: NVIDIA RTX 5000 Ada Generation Laptop GPU, compute capability 8.9, VMM: yes, VRAM: 16375 MiB
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 8689 (20efe75cf) with MSVC 19.44.35225.0 for Windows AMD64
system info: n_threads = 24, n_threads_batch = 24, total_threads = 32

system_info: n_threads = 24 (n_threads_batch = 24) / 32 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

init: using 31 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model 'H:\LMSTUDIOM_MODELs\lmstudio-community\gemma-4-26B-A4B-it-GGUF\gemma-4-26B-A4B-it-Q4_K_M.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4'
←[0mllama_model_load_from_file_impl: failed to load model
←[0mllama_params_fit: encountered an error while trying to fit params to free device memory: failed to load model
←[0mllama_params_fit: fitting params to free memory took 0.38 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA RTX 5000 Ada Generation Laptop GPU) (0000:01:00.0) - 15074 MiB free
llama_model_loader: loaded meta data with 46 key-value pairs and 658 tensors from H:\LMSTUDIOM_MODELs\lmstudio-community\gemma-4-26B-A4B-it-GGUF\gemma-4-26B-A4B-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 64
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 5: general.name str = Gemma 4 26B A4B
llama_model_loader: - kv 6: general.finetune str = it
llama_model_loader: - kv 7: general.size_label str = 26B-A4B
llama_model_loader: - kv 8: gemma4.block_count u32 = 30
llama_model_loader: - kv 9: gemma4.context_length u32 = 262144
llama_model_loader: - kv 10: gemma4.embedding_length u32 = 2816
llama_model_loader: - kv 11: gemma4.feed_forward_length u32 = 2112
llama_model_loader: - kv 12: gemma4.attention.head_count u32 = 16
llama_model_loader: - kv 13: gemma4.attention.head_count_kv arr[i32,30] = [8, 8, 8, 8, 8, 2, 8, 8, 8, 8, 8, 2, ...
llama_model_loader: - kv 14: gemma4.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 15: gemma4.rope.freq_base_swa f32 = 10000.000000
llama_model_loader: - kv 16: gemma4.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 17: gemma4.expert_count u32 = 128
llama_model_loader: - kv 18: gemma4.expert_used_count u32 = 8
llama_model_loader: - kv 19: gemma4.attention.key_length u32 = 512
llama_model_loader: - kv 20: gemma4.attention.value_length u32 = 512
llama_model_loader: - kv 21: gemma4.final_logit_softcapping f32 = 30.000000
llama_model_loader: - kv 22: gemma4.attention.sliding_window u32 = 1024
llama_model_loader: - kv 23: gemma4.attention.shared_kv_layers u32 = 0
llama_model_loader: - kv 24: gemma4.embedding_length_per_layer_input u32 = 0
llama_model_loader: - kv 25: gemma4.attention.sliding_window_pattern arr[bool,30] = [true, true, true, true, true, false,...
llama_model_loader: - kv 26: gemma4.attention.key_length_swa u32 = 256
llama_model_loader: - kv 27: gemma4.attention.value_length_swa u32 = 256
llama_model_loader: - kv 28: gemma4.expert_feed_forward_length u32 = 704
llama_model_loader: - kv 29: gemma4.rope.dimension_count u32 = 512
llama_model_loader: - kv 30: gemma4.rope.dimension_count_swa u32 = 256
llama_model_loader: - kv 31: tokenizer.ggml.model str = gemma4
llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,262144] = ["", "", "", "", ...
llama_model_loader: - kv 33: tokenizer.ggml.scores arr[f32,262144] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 34: tokenizer.ggml.token_type arr[i32,262144] = [3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 35: tokenizer.ggml.merges arr[str,514906] = ["\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n", ...
llama_model_loader: - kv 36: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 37: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 38: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 40: tokenizer.ggml.mask_token_id u32 = 4
llama_model_loader: - kv 41: tokenizer.chat_template str = {%- macro format_parameters(propertie...
llama_model_loader: - kv 42: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 43: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 44: general.quantization_version u32 = 2
llama_model_loader: - kv 45: general.file_type u32 = 15
llama_model_loader: - type f32: 392 tensors
llama_model_loader: - type q5_0: 32 tensors
llama_model_loader: - type q8_0: 28 tensors
llama_model_loader: - type q4_K: 192 tensors
llama_model_loader: - type q6_K: 14 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 15.63 GiB (5.32 BPW)
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4'
←[0mllama_model_load_from_file_impl: failed to load model
←[0mcommon_init_from_params: failed to load model 'H:\LMSTUDIOM_MODELs\lmstudio-community\gemma-4-26B-A4B-it-GGUF\gemma-4-26B-A4B-it-Q4_K_M.gguf'
←[0msrv load_model: failed to load model, 'H:\LMSTUDIOM_MODELs\lmstudio-community\gemma-4-26B-A4B-it-GGUF\gemma-4-26B-A4B-it-Q4_K_M.gguf'
←[0msrv operator (): operator (): cleaning up before exit...
main: exiting due to model loading error
←[0m`

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions