When trying to spin up the recipe for inference on H100 GPU, I got the following issue about GPU Memory usage indicating that the 128k max_model_len is too large for this model to fit into the GPU memory:
(EngineCore_DP0 pid=121) ValueError: To serve at least one request with the models's max seq len (131072), (13.66 GiB KV cache is needed, which is larger than the available KV cache memory (8.85 GiB). Based on the available memory, the estimated maximum model length is 68000. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine.\
When trying to spin up the recipe for inference on H100 GPU, I got the following issue about GPU Memory usage indicating that the 128k max_model_len is too large for this model to fit into the GPU memory:
(EngineCore_DP0 pid=121) ValueError: To serve at least one request with the models's max seq len (131072), (13.66 GiB KV cache is needed, which is larger than the available KV cache memory (8.85 GiB). Based on the available memory, the estimated maximum model length is 68000. Try increasing
gpu_memory_utilizationor decreasingmax_model_lenwhen initializing the engine.\