Skip to content

Commit 4933fa6

Browse files
authored
Merge pull request #77 from VectorInstitute/feature/update-docs
Update docs
2 parents a339b9c + 6c08b33 commit 4933fa6

File tree

4 files changed

+105
-65
lines changed

4 files changed

+105
-65
lines changed

README.md

Lines changed: 58 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
[![codecov](https://codecov.io/github/VectorInstitute/vector-inference/branch/develop/graph/badge.svg?token=NI88QSIGAC)](https://app.codecov.io/github/VectorInstitute/vector-inference/tree/develop)
99
![GitHub License](https://img.shields.io/github/license/VectorInstitute/vector-inference)
1010

11-
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update [`launch_server.sh`](vec_inf/launch_server.sh), [`vllm.slurm`](vec_inf/vllm.slurm), [`multinode_vllm.slurm`](vec_inf/multinode_vllm.slurm) and [`models.csv`](vec_inf/config/models.yaml) accordingly.
11+
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`cli/_helper.py`](vec_inf/cli/_helper.py), [`cli/_config.py`](vec_inf/cli/_config.py), [`vllm.slurm`](vec_inf/vllm.slurm), [`multinode_vllm.slurm`](vec_inf/multinode_vllm.slurm) and [`models.yaml`](vec_inf/config/models.yaml) accordingly.
1212

1313
## Installation
1414
If you are using the Vector cluster environment, and you don't need any customization to the inference server environment, run the following to install package:
@@ -22,8 +22,7 @@ Otherwise, we recommend using the provided [`Dockerfile`](Dockerfile) to set up
2222

2323
### `launch` command
2424

25-
The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for
26-
the user to send requests for inference.
25+
The `launch` command allows users to deploy a model as a slurm job. If the job successfully launches, a URL endpoint is exposed for the user to send requests for inference.
2726

2827
We will use the Llama 3.1 model as example, to launch an OpenAI compatible inference server for Meta-Llama-3.1-8B-Instruct, run:
2928

@@ -32,11 +31,12 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct
3231
```
3332
You should see an output like the following:
3433

35-
<img width="600" alt="launch_img" src="https://github.com/user-attachments/assets/ab658552-18b2-47e0-bf70-e539c3b898d5">
34+
<img width="600" alt="launch_img" src="https://github.com/user-attachments/assets/883e6a5b-8016-4837-8fdf-39097dfb18bf">
35+
3636

3737
#### Overrides
3838

39-
Models that are already supported by `vec-inf` would be launched using the [default parameters](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
39+
Models that are already supported by `vec-inf` would be launched using the cached configuration or [default configuration](vec_inf/config/models.yaml). You can override these values by providing additional parameters. Use `vec-inf launch --help` to see the full list of parameters that can be
4040
overriden. For example, if `qos` is to be overriden:
4141

4242
```bash
@@ -46,11 +46,11 @@ vec-inf launch Meta-Llama-3.1-8B-Instruct --qos <new_qos>
4646
#### Custom models
4747

4848
You can also launch your own custom model as long as the model architecture is [supported by vLLM](https://docs.vllm.ai/en/stable/models/supported_models.html), and make sure to follow the instructions below:
49-
* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT`.
49+
* Your model weights directory naming convention should follow `$MODEL_FAMILY-$MODEL_VARIANT` ($MODEL_VARIANT is OPTIONAL).
5050
* Your model weights directory should contain HuggingFace format weights.
51-
* You should create a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`
52-
Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model
53-
should be specified in that config file.
51+
* You should specify your model configuration by:
52+
* Creating a custom configuration file for your model and specify its path via setting the environment variable `VEC_INF_CONFIG`. Check the [default parameters](vec_inf/config/models.yaml) file for the format of the config file. All the parameters for the model should be specified in that config file.
53+
* Using launch command options to specify your model setup.
5454
* For other model launch parameters you can reference the default values for similar models using the [`list` command ](#list-command).
5555

5656
Here is an example to deploy a custom [Qwen2.5-7B-Instruct-1M](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-1M) model which is not
@@ -64,7 +64,7 @@ models:
6464
model_family: Qwen2.5
6565
model_variant: 7B-Instruct-1M
6666
model_type: LLM
67-
num_gpus: 2
67+
gpus_per_node: 1
6868
num_nodes: 1
6969
vocab_size: 152064
7070
max_model_len: 1010000
@@ -74,9 +74,6 @@ models:
7474
qos: m2
7575
time: 08:00:00
7676
partition: a40
77-
data_type: auto
78-
venv: singularity
79-
log_dir: default
8077
model_weights_parent_dir: /h/<username>/model-weights
8178
```
8279
@@ -86,17 +83,21 @@ You would then set the `VEC_INF_CONFIG` path using:
8683
export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
8784
```
8885

89-
Alternatively, you can also use launch parameters to set these values instead of using a user-defined config.
86+
Note that there are other parameters that can also be added to the config but not shown in this example, such as `data_type` and `log_dir`.
9087

9188
### `status` command
9289
You can check the inference server status by providing the Slurm job ID to the `status` command:
9390
```bash
94-
vec-inf status 13014393
91+
vec-inf status 15373800
9592
```
9693

97-
You should see an output like the following:
94+
If the server is pending for resources, you should see an output like this:
95+
96+
<img width="400" alt="status_pending_img" src="https://github.com/user-attachments/assets/b659c302-eae1-4560-b7a9-14eb3a822a2f">
9897

99-
<img width="400" alt="status_img" src="https://github.com/user-attachments/assets/7385b9ca-9159-4ca9-bae2-7e26d80d9747">
98+
When the server is ready, you should see an output like this:
99+
100+
<img width="400" alt="status_ready_img" src="https://github.com/user-attachments/assets/672986c2-736c-41ce-ac7c-1fb585cdcb0d">
100101

101102
There are 5 possible states:
102103

@@ -111,19 +112,19 @@ Note that the base URL is only available when model is in `READY` state, and if
111112
### `metrics` command
112113
Once your server is ready, you can check performance metrics by providing the Slurm job ID to the `metrics` command:
113114
```bash
114-
vec-inf metrics 13014393
115+
vec-inf metrics 15373800
115116
```
116117

117-
And you will see the performance metrics streamed to your console, note that the metrics are updated with a 10-second interval.
118+
And you will see the performance metrics streamed to your console, note that the metrics are updated with a 2-second interval.
118119

119-
<img width="400" alt="metrics_img" src="https://github.com/user-attachments/assets/e5ff2cd5-659b-4c88-8ebc-d8f3fdc023a4">
120+
<img width="400" alt="metrics_img" src="https://github.com/user-attachments/assets/3ee143d0-1a71-4944-bbd7-4c3299bf0339">
120121

121122
### `shutdown` command
122123
Finally, when you're finished using a model, you can shut it down by providing the Slurm job ID:
123124
```bash
124-
vec-inf shutdown 13014393
125+
vec-inf shutdown 15373800
125126
126-
> Shutting down model with Slurm Job ID: 13014393
127+
> Shutting down model with Slurm Job ID: 15373800
127128
```
128129

129130
### `list` command
@@ -133,19 +134,50 @@ vec-inf list
133134
```
134135
<img width="940" alt="list_img" src="https://github.com/user-attachments/assets/8cf901c4-404c-4398-a52f-0486f00747a3">
135136

137+
NOTE: The above screenshot does not represent the full list of models supported.
136138

137139
You can also view the default setup for a specific supported model by providing the model name, for example `Meta-Llama-3.1-70B-Instruct`:
138140
```bash
139141
vec-inf list Meta-Llama-3.1-70B-Instruct
140142
```
141-
<img width="400" alt="list_model_img" src="https://github.com/user-attachments/assets/30e42ab7-dde2-4d20-85f0-187adffefc3d">
143+
<img width="500" alt="list_model_img" src="https://github.com/user-attachments/assets/34e53937-2d86-443e-85f6-34e408653ddb">
142144

143145
`launch`, `list`, and `status` command supports `--json-mode`, where the command output would be structured as a JSON string.
144146

145147
## Send inference requests
146-
Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/completions.py`, and you should expect to see an output like the following:
147-
> {"id":"cmpl-c08d8946224747af9cce9f4d9f36ceb3","object":"text_completion","created":1725394970,"model":"Meta-Llama-3.1-8B-Instruct","choices":[{"index":0,"text":" is a question that many people may wonder. The answer is, of course, Ottawa. But if","logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":8,"total_tokens":28,"completion_tokens":20}}
148-
148+
Once the inference server is ready, you can start sending in inference requests. We provide example scripts for sending inference requests in [`examples`](examples) folder. Make sure to update the model server URL and the model weights location in the scripts. For example, you can run `python examples/inference/llm/chat_completions.py`, and you should expect to see an output like the following:
149+
150+
```json
151+
{
152+
"id":"chatcmpl-387c2579231948ffaf66cdda5439d3dc",
153+
"choices": [
154+
{
155+
"finish_reason":"stop",
156+
"index":0,
157+
"logprobs":null,
158+
"message": {
159+
"content":"Arrr, I be Captain Chatbeard, the scurviest chatbot on the seven seas! Ye be wantin' to know me identity, eh? Well, matey, I be a swashbucklin' AI, here to provide ye with answers and swappin' tales, savvy?",
160+
"role":"assistant",
161+
"function_call":null,
162+
"tool_calls":[],
163+
"reasoning_content":null
164+
},
165+
"stop_reason":null
166+
}
167+
],
168+
"created":1742496683,
169+
"model":"Meta-Llama-3.1-8B-Instruct",
170+
"object":"chat.completion",
171+
"system_fingerprint":null,
172+
"usage": {
173+
"completion_tokens":66,
174+
"prompt_tokens":32,
175+
"total_tokens":98,
176+
"prompt_tokens_details":null
177+
},
178+
"prompt_logprobs":null
179+
}
180+
```
149181
**NOTE**: For multimodal models, currently only `ChatCompletion` is available, and only one image can be provided for each prompt.
150182

151183
## SSH tunnel from your local device

docs/source/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ user_guide
1111
1212
```
1313

14-
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update [`launch_server.sh`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/launch_server.sh), [`vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/vllm.slurm), [`multinode_vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/multinode_vllm.slurm) and [`models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly.
14+
This repository provides an easy-to-use solution to run inference servers on [Slurm](https://slurm.schedmd.com/overview.html)-managed computing clusters using [vLLM](https://docs.vllm.ai/en/latest/). **All scripts in this repository runs natively on the Vector Institute cluster environment**. To adapt to other environments, update the environment variables in [`cli/_helper.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/cli/_helper.py), [`cli/_config.py`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/cli/_config_.py), [`vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/vllm.slurm), [`multinode_vllm.slurm`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/multinode_vllm.slurm), and model configurations in [`models.yaml`](https://github.com/VectorInstitute/vector-inference/blob/main/vec_inf/config/models.yaml) accordingly.
1515

1616
## Installation
1717

0 commit comments

Comments
 (0)