Skip to content

Commit 2a4a59f

Browse files
authored
[docs] Update release notes for LMI V16 (#2909)
1 parent 51ceabb commit 2a4a59f

File tree

3 files changed

+34
-27
lines changed

3 files changed

+34
-27
lines changed

serving/docs/lmi/release_notes.md

Lines changed: 28 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,34 @@
1-
# LMI V15 DLC containers release
1+
# Release Notes
22

3-
This document will contain the latest releases of our LMI containers for use on SageMaker.
4-
For details on any other previous releases, please refer our [github release page](https://github.com/deepjavalibrary/djl-serving/releases)
3+
Below are the release notes for recent Large Model Inference (LMI) images for use on SageMaker.
4+
For details on historical releases, refer to the [Github Releases page](https://github.com/deepjavalibrary/djl-serving/releases).
55

6-
## Release Notes
6+
## LMI V16 (DJL-Serving 0.34.0)
77

8-
### Key Features
8+
Meet your brand new image! 💿
99

10-
#### LMI Container (vllm) - Release 4-17-2025
11-
* vLLM updated to version 0.8.4
12-
* Llama4 Model Support
13-
* Updated Async Implementation, please see the [vLLM async user guide here](user_guides/vllm_user_guide.md#async-mode-configurations).
10+
#### LMI (vLLM) Image – 9-30-2025
11+
```
12+
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.34.0-lmi16.0.0-cu128
13+
```
14+
* vLLM version upgraded to `0.10.2`
15+
* Going forward, [async mode](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/lmi/user_guides/vllm_user_guide.md#async-mode-configurations) is the officially recommended configuration for the vLLM handler
16+
* Async vLLM handler now supports custom [input](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/lmi/user_guides/input_formatter_schema.md) and [output](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/lmi/user_guides/output_formatter_schema.md) formatters
17+
* Async vLLM handler now supports [multi-adapter](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/adapters.md) (LoRA) serving
18+
* Async vLLM handler now supports session-based [sticky routing](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/stateful_sessions.md)
19+
20+
## LMI V15 (DJL-Serving 0.33.0)
1421

15-
#### TensorRT-LLM Container - Coming Soon
16-
We plan to update our TensorRT-LLM integration in LMI v15.
17-
This update will include
22+
#### LMI (vLLM) Image – 4-17-2025
23+
```
24+
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128
25+
```
26+
* vLLM version upgraded to `0.8.4`
27+
* Llama4 Model Support
28+
* Updated Async Implementation, please see the [vLLM async user guide here](user_guides/vllm_user_guide.md#async-mode-configurations)
1829

19-
* Integration with TensorRT-LLM version 0.18.2
20-
* Deprecation of Rolling Batch support, and replacement with Async Engine support
30+
#### TensorRT-LLM Image – 6-24-2025
31+
```
32+
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.33.0-tensorrtllm0.21.0-cu128
33+
```
34+
* TensorRT-LLM version upgraded to `0.21.0rc1`

serving/docs/lmi/user_guides/tool_calling.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Tool calling is currently supported in LMI through the [vLLM](vllm_user_guide.md) backend only.
44

5-
Details on vLLM's tool calling support can be found [here](https://docs.vllm.ai/en/v0.8.4/features/tool_calling.html#how-to-write-a-tool-parser-plugin).
5+
Details on vLLM's tool calling support can be found [here](https://docs.vllm.ai/en/v0.10.2/features/tool_calling.html#how-to-write-a-tool-parser-plugin).
66

77
To enable tool calling in LMI, you must set the following environment variable configurations:
88

serving/docs/lmi/user_guides/vllm_user_guide.md

Lines changed: 5 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,11 @@ vLLM expects the model artifacts to be in the [standard HuggingFace format](../d
88

99
**Text Generation Models**
1010

11-
Here is the list of text generation models supported in [vLLM 0.8.4](https://docs.vllm.ai/en/v0.8.4/models/supported_models.html#decoder-only-language-models).
11+
Here is the list of text generation models supported in [vLLM 0.10.2](https://docs.vllm.ai/en/v0.10.2/models/supported_models.html#decoder-only-language-models).
1212

1313
**Multi Modal Models**
1414

15-
Here is the list of multi-modal models supported in [vLLM 0.8.4](https://docs.vllm.ai/en/v0.8.4/models/supported_models.html#decoder-only-language-models).
15+
Here is the list of multi-modal models supported in [vLLM 0.10.2](https://docs.vllm.ai/en/v0.10.2/models/supported_models.html#decoder-only-language-models).
1616

1717
### Model Coverage in CI
1818

@@ -34,7 +34,7 @@ The following set of models are tested in our nightly tests
3434

3535
## Quantization Support
3636

37-
The quantization techniques supported in vLLM 0.8.4 are listed [here](https://docs.vllm.ai/en/v0.8.4/quantization/supported_hardware.html).
37+
The quantization techniques supported in vLLM 0.10.2 are listed [here](https://docs.vllm.ai/en/v0.10.2/quantization/supported_hardware.html).
3838

3939
We recommend that regardless of which quantization technique you are using that you pre-quantize the model.
4040
Runtime quantization adds additional overhead to the endpoint startup time.
@@ -53,7 +53,7 @@ The following quantization techniques are supported for runtime quantization:
5353
You can leverage these techniques by specifying `option.quantize=<fp8|bitsandbytes>` in serving.properties, or `OPTION_QUANTIZE=<fp8|bitsandbytes>` environment variable.
5454

5555
Other quantization techniques supported by vLLM require ahead of time quantization to be served with LMI.
56-
You can find details on how to leverage those quantization techniques from the vLLM docs [here](https://docs.vllm.ai/en/v0.8.4/quantization/supported_hardware.html).
56+
You can find details on how to leverage those quantization techniques from the vLLM docs [here](https://docs.vllm.ai/en/v0.10.2/quantization/supported_hardware.html).
5757

5858
### Ahead of time (AOT) quantization
5959

@@ -73,11 +73,6 @@ Async mode integrates with the vLLM Async Engine via the OpenAI modules.
7373
This ensures that LMI's vLLM support is always in parity with upstream vLLM with respect to both engine-configurations and API schemas.
7474
Async mode will become the default, and only supported mode, in an upcoming release.
7575

76-
Currently, async mode does not support multi-lora hosting.
77-
This functionality will be added to async mode soon.
78-
If you are not using LMI's vLLM implementation to deploy multiple lora adapters,
79-
async mode is the recommended deployment mode.
80-
8176
### Async Mode Configurations
8277

8378
**serving.properties**
@@ -125,8 +120,6 @@ OPTION_MAX_ROLLING_BATCH_SIZE=64
125120

126121
### LoRA Adapter Support
127122

128-
**Note: LoRA adapter support is only supported in rolling batch mode. It will be added to async mode soon.**
129-
130123
vLLM has support for LoRA adapters using the [adapters API](../../adapters.md).
131124
In order to use the adapters, you must begin by enabling them by setting `option.enable_lora=true`.
132125
Following that, you can configure the LoRA support through the additional settings:
@@ -161,7 +154,7 @@ Those situations will be called out specifically.
161154

162155
In addition to the configurations specified in the table above, LMI supports all additional vLLM EngineArguments in Pass-Through mode.
163156
Pass-Through configurations are not processed or validated by LMI.
164-
You can find the set of EngineArguments supported by vLLM [here](https://docs.vllm.ai/en/v0.8.4_a/serving/engine_args.html).
157+
You can find the set of EngineArguments supported by vLLM [here](https://docs.vllm.ai/en/v0.10.2_a/serving/engine_args.html).
165158

166159
You can specify these pass-through configurations in the serving.properties file by prefixing the configuration with `option.<config>`,
167160
or as environment variables by prefixing the configuration with `OPTION_<CONFIG>`.

0 commit comments

Comments
 (0)