Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ramalama hangs when trying to shutdown #807

Closed
tuananh opened this issue Feb 13, 2025 · 13 comments
Closed

ramalama hangs when trying to shutdown #807

tuananh opened this issue Feb 13, 2025 · 13 comments

Comments

@tuananh
Copy link

tuananh commented Feb 13, 2025

its taking forever to shutdown so i have to kill the process & container manually.

@rhatdan
Copy link
Member

rhatdan commented Feb 13, 2025

What command were you running that failed to shutdown?

What version of ramalama are you running?

@benoitf
Copy link
Contributor

benoitf commented Feb 13, 2025

@tuananh might be related to #753 ?

@rhatdan
Copy link
Member

rhatdan commented Feb 13, 2025

If this is, then you need to get new version of ramalama as well as new version of container image.

@tuananh
Copy link
Author

tuananh commented Feb 13, 2025

What command were you running that failed to shutdown?

What version of ramalama are you running?

its serve command and im using latest version on arch repo

@rhatdan
Copy link
Member

rhatdan commented Feb 14, 2025

0.6.0 with what container image? Or are you using --no-container? The fix involved updating RamaLama and the version of llama.cpp.

You should see --init in the podman run command if you run with --debug flag.

@tuananh
Copy link
Author

tuananh commented Feb 14, 2025

latest version on archlinux is 0.5.5

ramalama --version
ramalama version 0.5.5

the command i use is just ramalama serve <model> which I guess is the container version because I can see the container when running podman ps

@ericcurtin
Copy link
Collaborator

Didn't even realize we were packaged on arch neat:

https://aur.archlinux.org/packages/ramalama

I recommend logging this downstream with the Arch package maintainer, he should just update the package, it's fixed here upstream.

@tuananh
Copy link
Author

tuananh commented Feb 14, 2025

So i update the package myself to 0.6.0 and it still hangs when i try to shut it down

Image

Image

command i use & its logs

ramalama serve ollama://deepseek-r1:32b
build: 4607 (aa6fb132) with cc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2) for x86_64-redhat-linux
system info: n_threads = 64, n_threads_batch = 64, total_threads = 128

system_info: n_threads = 64 (n_threads_batch = 64) / 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 127
main: loading model
srv    load_model: loading model '/mnt/models/model.file'
llama_model_loader: loaded meta data with 26 key-value pairs and 771 tensors from /mnt/models/model.file (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 32B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 64
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 27648
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  13:                          general.file_type u32              = 15
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646                          llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643                          llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false                           llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2                               llama_model_loader: - type  f32:  321 tensors                 llama_model_loader: - type q4_K:  385 tensors                 llama_model_loader: - type q6_K:   65 tensors                 print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.48 GiB (4.85 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5120
print_info: n_layer          = 64
print_info: n_head           = 40
print_info: n_head_kv        = 8
print_info: n_rot            = 128                            print_info: n_swa            = 0
print_info: n_embd_head_k    = 128                            print_info: n_embd_head_v    = 128
print_info: n_gqa            = 5                              print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024                           print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05                        print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00                        print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 27648                          print_info: n_expert         = 0
print_info: n_expert_used    = 0                              print_info: causal attn      = 1
print_info: pooling type     = 0                              print_info: rope type        = 2
print_info: rope scaling     = linear                         print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1                              print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown                        print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0                              print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0                              print_info: ssm_dt_b_c_rms   = 0                              print_info: model type       = 32B
print_info: model params     = 32.76 B                        print_info: general.name     = DeepSeek R1 Distill Qwen 32B
print_info: vocab type       = BPE                            print_info: n_vocab          = 152064
print_info: n_merges         = 151387                         print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'                                                             print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>' print_info: PAD token        = 151643 '<|end▁of▁sentence|>' print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'        print_info: FIM SUF token    = 151661 '<|fim_suffix|>'        print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'           print_info: FIM REP token    = 151663 '<|repo_name|>'         print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>' print_info: EOG token        = 151662 '<|fim_pad|>'           print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'          print_info: max token length = 256
load_tensors:   CPU_Mapped model buffer size = 18926.01 MiB   llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 2048                   llama_init_from_model: n_ctx_per_seq = 2048                   llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512                    llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 1000000.0              llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =   512.00 MiB llama_init_from_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.58 MiB                                                        llama_init_from_model:        CPU compute buffer size =   307.00 MiB                                                        llama_init_from_model: graph nodes  = 2246
llama_init_from_model: graph splits = 1                       common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 2048                                                             main: model loaded
main: chat template, chat_template: {% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}{%- endfor %}{{bos_token}}{{ns.system_prompt}}{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<|User|>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls']%}{%- if not ns.is_first %}{{'<|Assistant|><|tool▁calls▁begin|><|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{%- set ns.is_first = true -%}{%- else %}{{'\n' + '<|tool▁call▁begin|>' + tool['type'] + '<|tool▁sep|>' + tool['function']['name'] + '\n' + '```json' + '\n' + tool['function']['arguments'] + '\n' + '```' + '<|tool▁call▁end|>'}}{{'<|tool▁calls▁end|><|end▁of▁sentence|>'}}{%- endif %}{%- endfor %}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<|tool▁outputs▁end|>' + message['content'] + '<|end▁of▁sentence|>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{% if '</think>' in content %}{% set content = content.split('</think>')[-1] %}{% endif %}{{'<|Assistant|>' + content + '<|end▁of▁sentence|>'}}{%- endif %}{%- endif %}{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<|tool▁outputs▁begin|><|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\n<|tool▁output▁begin|>' + message['content'] + '<|tool▁output▁end|>'}}{%- endif %}{%- endif %}{%- endfor -%}{% if ns.is_tool %}{{'<|tool▁outputs▁end|>'}}{% endif %}{% if add_generation_prompt and not ns.is_tool %}{{'<|Assistant|>'}}{% endif %}, example_format: 'You are a helpful assistant
                                                              <|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'                           main: server is listening on http://0.0.0.0:8080 - starting the main loop                                                   srv  update_slots: all slots are idle
request: GET / 10.0.0.245 200                                 request: GET /favicon.ico 10.0.0.245 404
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 10                         slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 10, n_tokens = 10, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 10, n_tokens = 10
slot      release: id  0 | task 0 | stop processing: n_past = 392, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =     310.63 ms /    10 tokens (   31.06 ms per token,    32.19 tokens per second)
       eval time =   60010.24 ms /   383 tokens (  156.68 ms per token,     6.38 tokens per second)
      total time =   60320.87 ms /   393 tokens
srv  update_slots: all slots are idle
request: POST /v1/chat/completions 10.0.0.245 200
slot launch_slot_: id  0 | task 384 | processing task
slot update_slots: id  0 | task 384 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 27
slot update_slots: id  0 | task 384 | kv cache rm [10, end)
slot update_slots: id  0 | task 384 | prompt processing progress, n_past = 27, n_tokens = 17, progress = 0.629630
slot update_slots: id  0 | task 384 | prompt done, n_past = 27, n_tokens = 17
slot      release: id  0 | task 384 | stop processing: n_past = 122, truncated = 0
slot print_timing: id  0 | task 384 |
prompt eval time =     549.98 ms /    17 tokens (   32.35 ms per token,    30.91 tokens per second)
       eval time =   14705.24 ms /    96 tokens (  153.18 ms per token,     6.53 tokens per second)
      total time =   15255.22 ms /   113 tokens
srv  update_slots: all slots are idle
request: POST /v1/chat/completions 10.0.0.245 200

@ericcurtin ericcurtin reopened this Feb 14, 2025
@benoitf
Copy link
Contributor

benoitf commented Feb 14, 2025

AFAIK it's fixed in the main branch but not in 0.6.0 (at least this is what I experienced)

@tuananh
Copy link
Author

tuananh commented Feb 15, 2025

AFAIK it's fixed in the main branch but not in 0.6.0 (at least this is what I experienced)

Thanks. I will try to build from main.

@tuananh
Copy link
Author

tuananh commented Feb 15, 2025

@benoitf confirmed it's fixed on main. however, the port forwarding doesnt seem to work. i can't access 8080 from host but i guess it's another issue.

@rhatdan
Copy link
Member

rhatdan commented Feb 15, 2025

Please open a different issue on the 8080.

@rhatdan rhatdan closed this as completed Feb 15, 2025
@tuananh
Copy link
Author

tuananh commented Feb 15, 2025

Create a new issue for the port fwd #823

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants