Skip to content

Add support for loading models using the Run:AI Model Streamer #317

@pwschuurman

Description

@pwschuurman

Component

Helm Chart

Desired use case or feature

Currently the llm-d Helm chart supports two protocols for sampleApplication.model.modelArtifactURI:

  • hf://: This pulls the model just-in-time when vLLM starts up
  • pvc://: This pulls the model from a PVC. Optionally, a model can be downloaded with a transfer job from HuggingFace and stored in the specified PVC.

vLLM also supports streaming a model directly from object storage with higher concurrency with --load-format=model_streamer (docs). This allows loading from an object storage backend / filesystem, rather than from using HuggingFace or a PVC with the default loader.

Proposed solution

In order to use the model streamer, vLLM needs additional command line arguments:

  • --load-format runai_streamer
  • Model Name: Can be specified either through the --model argument or directly as a served model name (eg: --model=s3://<path-to-model> or vllm serve s3://<path-to-model>.

I propose to add an optional .modelService.vllm.loadFormat parameter to the helm chart. When set to runai_streamer, relax the "Protocol" constraint (remove the model source check when the runai_streamer vLLM load format is specified). The loadFormat will also pass the --load-format command line argument to vLLM.

  • When pvc:// is specified as the protocol, this allows the pvc:// protocol to continue being used. The model streamer can just reference the path as is being done with the default loader today.
  • When the protocol is not recognized (eg: s3://), the modelArtifactsURI will be used as the model name, passing through s3://<path-to-model> as the served model argument to vLLM, as is done today for the PVC case (PVC protocol path suffix is passed in as the .ModelPath).

Additionally, loading can be tuned with the parameter --model-loader-extra-config, or environment variables to vLLM. Command line args can be passed in through .sampleApplication.decode.extraArgs or .sampleApplication.prefill.extraArgs today, but there may be a more optimal way of passing these parameters consistently to all instances of vLLM (eg: .modelService.vllm.extraArgs and .modelService.vllm.extraEnvVars parameters).

Alternatives

Another option could be to add a new runai_streamer "Protocol" to the modelArtifactsURI chart parameter. This could encode both the object storage URI / filesystem path.

  • If an object storage system is used, the runai_streamer protocol would be unwrapped to identify the underlying model protocol. For example runai_streamer://s3://<path_to_model> would allow the suffix s3://<path_to_model> to be used as the model name for the inference server.
  • If a local filesystem is used, this complicates things, as the user may want to specify a PVC. So this may required wrapping protocols (eg: runai_streamer://pvc://<pvc_name>/<path_to_model>)

I think this option is less intuitive for the end user, as it could lead to a complex modelArtifactURI, and more challenging unnesting logic in the llm-d launcher script.

Additional context or screenshots

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    chartRelated to the Helm Chart

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions