-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Component
Helm Chart
Desired use case or feature
Currently the llm-d Helm chart supports two protocols for sampleApplication.model.modelArtifactURI
:
hf://
: This pulls the model just-in-time when vLLM starts uppvc://
: This pulls the model from a PVC. Optionally, a model can be downloaded with a transfer job from HuggingFace and stored in the specified PVC.
vLLM also supports streaming a model directly from object storage with higher concurrency with --load-format=model_streamer
(docs). This allows loading from an object storage backend / filesystem, rather than from using HuggingFace or a PVC with the default loader.
Proposed solution
In order to use the model streamer, vLLM needs additional command line arguments:
--load-format runai_streamer
- Model Name: Can be specified either through the
--model
argument or directly as a served model name (eg:--model=s3://<path-to-model>
orvllm serve s3://<path-to-model>
.
I propose to add an optional .modelService.vllm.loadFormat
parameter to the helm chart. When set to runai_streamer
, relax the "Protocol" constraint (remove the model source check when the runai_streamer
vLLM load format is specified). The loadFormat
will also pass the --load-format
command line argument to vLLM.
- When
pvc://
is specified as the protocol, this allows thepvc://
protocol to continue being used. The model streamer can just reference the path as is being done with the default loader today. - When the protocol is not recognized (eg:
s3://
), themodelArtifactsURI
will be used as the model name, passing throughs3://<path-to-model>
as the served model argument to vLLM, as is done today for the PVC case (PVC protocol path suffix is passed in as the.ModelPath
).
Additionally, loading can be tuned with the parameter --model-loader-extra-config
, or environment variables to vLLM. Command line args can be passed in through .sampleApplication.decode.extraArgs
or .sampleApplication.prefill.extraArgs
today, but there may be a more optimal way of passing these parameters consistently to all instances of vLLM (eg: .modelService.vllm.extraArgs
and .modelService.vllm.extraEnvVars
parameters).
Alternatives
Another option could be to add a new runai_streamer
"Protocol" to the modelArtifactsURI
chart parameter. This could encode both the object storage URI / filesystem path.
- If an object storage system is used, the
runai_streamer
protocol would be unwrapped to identify the underlying model protocol. For examplerunai_streamer://s3://<path_to_model>
would allow the suffixs3://<path_to_model>
to be used as the model name for the inference server. - If a local filesystem is used, this complicates things, as the user may want to specify a PVC. So this may required wrapping protocols (eg:
runai_streamer://pvc://<pvc_name>/<path_to_model>
)
I think this option is less intuitive for the end user, as it could lead to a complex modelArtifactURI, and more challenging unnesting logic in the llm-d launcher script.
Additional context or screenshots
No response