Add support for loading models using the Run:AI Model Streamer

### Component

Helm Chart

### Desired use case or feature

Currently the llm-d Helm chart supports two protocols for `sampleApplication.model.modelArtifactURI`:
 * `hf://`: This pulls the model just-in-time when vLLM starts up
 * `pvc://`: This pulls the model from a PVC. Optionally, a model can be downloaded with a transfer job from HuggingFace and stored in the specified PVC.

vLLM also supports streaming a model directly from object storage with higher concurrency with `--load-format=model_streamer` ([docs](https://docs.vllm.ai/en/v0.8.5/models/extensions/runai_model_streamer.html)). This allows loading from an object storage backend / filesystem, rather than from using HuggingFace or a PVC with the default loader.

### Proposed solution

In order to use the model streamer, vLLM needs additional command line arguments:
 * `--load-format runai_streamer`
 * Model Name: Can be specified either through the `--model` argument or directly as a served model name (eg: `--model=s3://<path-to-model>` or `vllm serve s3://<path-to-model>`.

I propose to add an optional `.modelService.vllm.loadFormat` parameter to the helm chart. When set to `runai_streamer`, relax the "Protocol" constraint (remove the [model source check](https://github.com/llm-d/llm-d-deployer/blob/c9e16e91d264ff719d4e9885fbe5e1b239eb87a1/quickstart/llmd-installer.sh#L380) when the `runai_streamer` vLLM load format is specified). The `loadFormat` will also pass the `--load-format` command line argument to vLLM.
  * When `pvc://` is specified as the protocol, this allows the `pvc://` protocol to continue being used. The model streamer can just reference the path as is being done with the default loader today.
  * When the protocol is not recognized (eg: `s3://`), the `modelArtifactsURI` will be used as the model name, passing through `s3://<path-to-model>` as the served model argument to vLLM, as is done today for the PVC case (PVC protocol path suffix is passed in as the `.ModelPath`).

Additionally, loading can be tuned with the parameter `--model-loader-extra-config`, or environment variables to vLLM. Command line args can be passed in through `.sampleApplication.decode.extraArgs` or `.sampleApplication.prefill.extraArgs` today, but there may be a more optimal way of passing these parameters consistently to all instances of vLLM (eg: `.modelService.vllm.extraArgs` and `.modelService.vllm.extraEnvVars` parameters).

### Alternatives

Another option could be to add a new `runai_streamer` "Protocol" to the `modelArtifactsURI` chart parameter. This could encode both the object storage URI / filesystem path.
 * If an object storage system is used, the `runai_streamer` protocol would be unwrapped to identify the underlying model protocol. For example `runai_streamer://s3://<path_to_model>` would allow the suffix `s3://<path_to_model>` to be used as the model name for the inference server.
 * If a local filesystem is used, this complicates things, as the user may want to specify a PVC. So this may required wrapping protocols (eg: `runai_streamer://pvc://<pvc_name>/<path_to_model>`)

I think this option is less intuitive for the end user, as it could lead to a complex modelArtifactURI, and more challenging unnesting logic in the llm-d launcher script. 

### Additional context or screenshots

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for loading models using the Run:AI Model Streamer #317

Component

Desired use case or feature

Proposed solution

Alternatives

Additional context or screenshots

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add support for loading models using the Run:AI Model Streamer #317

Description

Component

Desired use case or feature

Proposed solution

Alternatives

Additional context or screenshots

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions