You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run this model on my Nvidia 3090 w/ 24GB VRAM. It is crashing on startup.
❯ ramalama --runtime vllm --gpu serve deepseek-r1
INFO 02-09 18:59:41 __init__.py:190] Automatically detected platform cuda.
INFO 02-09 18:59:43 api_server.py:206] Started engine process with PID 70
INFO 02-09 18:59:49 __init__.py:190] Automatically detected platform cuda.
INFO 02-09 18:59:59 config.py:2382] Downcasting torch.float32 to torch.float16.
INFO 02-09 19:00:06 config.py:2382] Downcasting torch.float32 to torch.float16.
INFO 02-09 19:00:08 config.py:542] This model supports multiple tasks: {'reward', 'score', 'classify', 'generate', 'embed'}. Defaulting to 'generate'.
WARNING 02-09 19:00:08 config.py:621] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 02-09 19:00:15 config.py:542] This model supports multiple tasks: {'reward', 'embed', 'score', 'classify', 'generate'}. Defaulting to 'generate'.
WARNING 02-09 19:00:15 config.py:621] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 02-09 19:00:16 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3.dev272+g1960652a3) with config: model='/mnt/models/model.file', speculative_config=None, tokenizer='/mnt/models/model.file', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.GGUF, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/mnt/models/model.file, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
INFO 02-09 19:00:47 cuda.py:230] Using Flash Attention backend.
INFO 02-09 19:00:47 model_runner.py:1110] Starting to load model /mnt/models/model.file...
/opt/vllm/lib64/python3.12/site-packages/torch/nested/__init__.py:226: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.)
return _nested.nested_tensor(
INFO 02-09 19:01:03 model_runner.py:1115] Loading model weights took 4.3979 GB
INFO 02-09 19:01:08 worker.py:267] Memory profiling takes 4.31 seconds
INFO 02-09 19:01:08 worker.py:267] the current vLLM instance can use total_gpu_memory (23.57GiB) x gpu_memory_utilization (0.90) = 21.21GiB
INFO 02-09 19:01:08 worker.py:267] model weights take 4.40GiB; non_torch_memory takes 0.11GiB; PyTorch activation peak memory takes 1.39GiB; the rest of the memory reserved for KV Cache is 15.31GiB.
INFO 02-09 19:01:08 executor_base.py:110] # CUDA blocks: 17913, # CPU blocks: 4681
INFO 02-09 19:01:08 executor_base.py:115] Maximum concurrency for 2048 tokens per request: 139.95x
INFO 02-09 19:01:12 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:28<00:00, 1.24it/s]
INFO 02-09 19:01:40 model_runner.py:1562] Graph capturing finished in 28 secs, took 0.12 GiB
INFO 02-09 19:01:40 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 36.65 seconds
/opt/vllm/lib64/python3.12/site-packages/vllm_tgis_adapter/http.py:49: RuntimeWarning: coroutine 'init_app_state' was never awaited
init_app_state(engine, model_config, app.state, args)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
INFO 02-09 19:01:40 launcher.py:21] Available routes are:
INFO 02-09 19:01:40 launcher.py:29] Route: /openapi.json, Methods: HEAD, GET
INFO 02-09 19:01:40 launcher.py:29] Route: /docs, Methods: HEAD, GET
INFO 02-09 19:01:40 launcher.py:29] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 02-09 19:01:40 launcher.py:29] Route: /redoc, Methods: HEAD, GET
INFO 02-09 19:01:40 launcher.py:29] Route: /health, Methods: GET
INFO 02-09 19:01:40 launcher.py:29] Route: /ping, Methods: GET, POST
INFO 02-09 19:01:40 launcher.py:29] Route: /tokenize, Methods: POST
INFO 02-09 19:01:40 launcher.py:29] Route: /detokenize, Methods: POST
INFO 02-09 19:01:40 launcher.py:29] Route: /v1/models, Methods: GET
INFO 02-09 19:01:40 launcher.py:29] Route: /version, Methods: GET
INFO 02-09 19:01:40 launcher.py:29] Route: /v1/chat/completions, Methods: POST
INFO 02-09 19:01:40 launcher.py:29] Route: /v1/completions, Methods: POST
INFO 02-09 19:01:40 launcher.py:29] Route: /v1/embeddings, Methods: POST
INFO 02-09 19:01:40 launcher.py:29] Route: /pooling, Methods: POST
INFO 02-09 19:01:40 launcher.py:29] Route: /score, Methods: POST
INFO 02-09 19:01:40 launcher.py:29] Route: /v1/score, Methods: POST
INFO 02-09 19:01:40 launcher.py:29] Route: /rerank, Methods: POST
INFO 02-09 19:01:40 launcher.py:29] Route: /v1/rerank, Methods: POST
INFO 02-09 19:01:40 launcher.py:29] Route: /v2/rerank, Methods: POST
INFO 02-09 19:01:40 launcher.py:29] Route: /invocations, Methods: POST
WARNING 02-09 19:01:40 grpc_server.py:194] TGIS Metrics currently disabled in decoupled front-end mode, set DISABLE_FRONTEND_MULTIPROCESSING=True to enable
ERROR: Traceback (most recent call last):
File "/opt/vllm/lib64/python3.12/site-packages/starlette/datastructures.py", line 673, in __getattr__
return self._state[key]
~~~~~~~~~~~^^^^^
KeyError: 'log_stats'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/vllm/lib64/python3.12/site-packages/starlette/routing.py", line 693, in lifespan
async with self.lifespan_context(app) as maybe_state:
File "/usr/lib64/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/fastapi/routing.py", line 133, in merged_lifespan
async with original_context(app) as maybe_original_state:
File "/usr/lib64/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 100, in lifespan
if app.state.log_stats:
^^^^^^^^^^^^^^^^^^^
File "/opt/vllm/lib64/python3.12/site-packages/starlette/datastructures.py", line 676, in __getattr__
raise AttributeError(message.format(self.__class__.__name__, key))
AttributeError: 'State' object has no attribute 'log_stats'
INFO 02-09 19:01:40 grpc_server.py:946] gRPC Server started at 0.0.0.0:8033
ERROR: Application startup failed. Exiting.
Gracefully stopping gRPC server
[rank0]:[W209 19:01:41.297903495 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
The text was updated successfully, but these errors were encountered:
I am trying to run this model on my Nvidia 3090 w/ 24GB VRAM. It is crashing on startup.
The text was updated successfully, but these errors were encountered: