You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Going forward, [async mode](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/lmi/user_guides/vllm_user_guide.md#async-mode-configurations) is the officially recommended configuration for the vLLM handler
16
+
* Async vLLM handler now supports custom [input](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/lmi/user_guides/input_formatter_schema.md) and [output](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/lmi/user_guides/output_formatter_schema.md) formatters
17
+
* Async vLLM handler now supports [multi-adapter](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/adapters.md) (LoRA) serving
18
+
* Async vLLM handler now supports session-based [sticky routing](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/stateful_sessions.md)
19
+
20
+
## LMI V15 (DJL-Serving 0.33.0)
14
21
15
-
#### TensorRT-LLM Container - Coming Soon
16
-
We plan to update our TensorRT-LLM integration in LMI v15.
Copy file name to clipboardExpand all lines: serving/docs/lmi/user_guides/tool_calling.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
Tool calling is currently supported in LMI through the [vLLM](vllm_user_guide.md) backend only.
4
4
5
-
Details on vLLM's tool calling support can be found [here](https://docs.vllm.ai/en/v0.8.4/features/tool_calling.html#how-to-write-a-tool-parser-plugin).
5
+
Details on vLLM's tool calling support can be found [here](https://docs.vllm.ai/en/v0.10.2/features/tool_calling.html#how-to-write-a-tool-parser-plugin).
6
6
7
7
To enable tool calling in LMI, you must set the following environment variable configurations:
Copy file name to clipboardExpand all lines: serving/docs/lmi/user_guides/vllm_user_guide.md
+5-12Lines changed: 5 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,11 +8,11 @@ vLLM expects the model artifacts to be in the [standard HuggingFace format](../d
8
8
9
9
**Text Generation Models**
10
10
11
-
Here is the list of text generation models supported in [vLLM 0.8.4](https://docs.vllm.ai/en/v0.8.4/models/supported_models.html#decoder-only-language-models).
11
+
Here is the list of text generation models supported in [vLLM 0.10.2](https://docs.vllm.ai/en/v0.10.2/models/supported_models.html#decoder-only-language-models).
12
12
13
13
**Multi Modal Models**
14
14
15
-
Here is the list of multi-modal models supported in [vLLM 0.8.4](https://docs.vllm.ai/en/v0.8.4/models/supported_models.html#decoder-only-language-models).
15
+
Here is the list of multi-modal models supported in [vLLM 0.10.2](https://docs.vllm.ai/en/v0.10.2/models/supported_models.html#decoder-only-language-models).
16
16
17
17
### Model Coverage in CI
18
18
@@ -34,7 +34,7 @@ The following set of models are tested in our nightly tests
34
34
35
35
## Quantization Support
36
36
37
-
The quantization techniques supported in vLLM 0.8.4 are listed [here](https://docs.vllm.ai/en/v0.8.4/quantization/supported_hardware.html).
37
+
The quantization techniques supported in vLLM 0.10.2 are listed [here](https://docs.vllm.ai/en/v0.10.2/quantization/supported_hardware.html).
38
38
39
39
We recommend that regardless of which quantization technique you are using that you pre-quantize the model.
40
40
Runtime quantization adds additional overhead to the endpoint startup time.
@@ -53,7 +53,7 @@ The following quantization techniques are supported for runtime quantization:
53
53
You can leverage these techniques by specifying `option.quantize=<fp8|bitsandbytes>` in serving.properties, or `OPTION_QUANTIZE=<fp8|bitsandbytes>` environment variable.
54
54
55
55
Other quantization techniques supported by vLLM require ahead of time quantization to be served with LMI.
56
-
You can find details on how to leverage those quantization techniques from the vLLM docs [here](https://docs.vllm.ai/en/v0.8.4/quantization/supported_hardware.html).
56
+
You can find details on how to leverage those quantization techniques from the vLLM docs [here](https://docs.vllm.ai/en/v0.10.2/quantization/supported_hardware.html).
57
57
58
58
### Ahead of time (AOT) quantization
59
59
@@ -73,11 +73,6 @@ Async mode integrates with the vLLM Async Engine via the OpenAI modules.
73
73
This ensures that LMI's vLLM support is always in parity with upstream vLLM with respect to both engine-configurations and API schemas.
74
74
Async mode will become the default, and only supported mode, in an upcoming release.
75
75
76
-
Currently, async mode does not support multi-lora hosting.
77
-
This functionality will be added to async mode soon.
78
-
If you are not using LMI's vLLM implementation to deploy multiple lora adapters,
0 commit comments