[docs] Update release notes for LMI V16 (#2909)

ethnzhng · web-flow · commit 2a4a59f6644a · 2025-10-02T14:40:43.000-07:00
diff --git a/serving/docs/lmi/release_notes.md b/serving/docs/lmi/release_notes.md
@@ -1,20 +1,34 @@
-# LMI V15 DLC containers release
+# Release Notes
 
-This document will contain the latest releases of our LMI containers for use on SageMaker. 
-For details on any other previous releases, please refer our [github release page](https://github.com/deepjavalibrary/djl-serving/releases)
+Below are the release notes for recent Large Model Inference (LMI) images for use on SageMaker.
+For details on historical releases, refer to the [Github Releases page](https://github.com/deepjavalibrary/djl-serving/releases).
 
-## Release Notes
+## LMI V16 (DJL-Serving 0.34.0)
 
-### Key Features
+Meet your brand new image! 💿
 
-#### LMI Container (vllm) - Release 4-17-2025
-* vLLM updated to version 0.8.4
-* Llama4 Model Support
-* Updated Async Implementation, please see the [vLLM async user guide here](user_guides/vllm_user_guide.md#async-mode-configurations). 
+#### LMI (vLLM) Image – 9-30-2025
+```
+763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.34.0-lmi16.0.0-cu128
+```
+* vLLM version upgraded to `0.10.2`
+* Going forward, [async mode](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/lmi/user_guides/vllm_user_guide.md#async-mode-configurations) is the officially recommended configuration for the vLLM handler 
+* Async vLLM handler now supports custom [input](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/lmi/user_guides/input_formatter_schema.md) and [output](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/lmi/user_guides/output_formatter_schema.md) formatters 
+* Async vLLM handler now supports [multi-adapter](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/adapters.md) (LoRA) serving
+* Async vLLM handler now supports session-based [sticky routing](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/stateful_sessions.md)
+
+## LMI V15 (DJL-Serving 0.33.0)
 
-#### TensorRT-LLM Container - Coming Soon 
-We plan to update our TensorRT-LLM integration in LMI v15.
-This update will include
+#### LMI (vLLM) Image – 4-17-2025
+```
+763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128
+```
+* vLLM version upgraded to `0.8.4`
+* Llama4 Model Support
+* Updated Async Implementation, please see the [vLLM async user guide here](user_guides/vllm_user_guide.md#async-mode-configurations) 
 
-* Integration with TensorRT-LLM version 0.18.2
-* Deprecation of Rolling Batch support, and replacement with Async Engine support
+#### TensorRT-LLM Image – 6-24-2025
+```
+763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.33.0-tensorrtllm0.21.0-cu128
+```
+* TensorRT-LLM version upgraded to `0.21.0rc1`
diff --git a/serving/docs/lmi/user_guides/tool_calling.md b/serving/docs/lmi/user_guides/tool_calling.md
@@ -2,7 +2,7 @@
 
 Tool calling is currently supported in LMI through the [vLLM](vllm_user_guide.md) backend only.
 
-Details on vLLM's tool calling support can be found [here](https://docs.vllm.ai/en/v0.8.4/features/tool_calling.html#how-to-write-a-tool-parser-plugin).
+Details on vLLM's tool calling support can be found [here](https://docs.vllm.ai/en/v0.10.2/features/tool_calling.html#how-to-write-a-tool-parser-plugin).
 
 To enable tool calling in LMI, you must set the following environment variable configurations:
 
diff --git a/serving/docs/lmi/user_guides/vllm_user_guide.md b/serving/docs/lmi/user_guides/vllm_user_guide.md
@@ -8,11 +8,11 @@ vLLM expects the model artifacts to be in the [standard HuggingFace format](../d
 
 **Text Generation Models**
 
-Here is the list of text generation models supported in [vLLM 0.8.4](https://docs.vllm.ai/en/v0.8.4/models/supported_models.html#decoder-only-language-models).
+Here is the list of text generation models supported in [vLLM 0.10.2](https://docs.vllm.ai/en/v0.10.2/models/supported_models.html#decoder-only-language-models).
 
 **Multi Modal Models**
 
-Here is the list of multi-modal models supported in [vLLM 0.8.4](https://docs.vllm.ai/en/v0.8.4/models/supported_models.html#decoder-only-language-models).
+Here is the list of multi-modal models supported in [vLLM 0.10.2](https://docs.vllm.ai/en/v0.10.2/models/supported_models.html#decoder-only-language-models).
 
 ### Model Coverage in CI
 
@@ -34,7 +34,7 @@ The following set of models are tested in our nightly tests
 
 ## Quantization Support
 
-The quantization techniques supported in vLLM 0.8.4 are listed [here](https://docs.vllm.ai/en/v0.8.4/quantization/supported_hardware.html).
+The quantization techniques supported in vLLM 0.10.2 are listed [here](https://docs.vllm.ai/en/v0.10.2/quantization/supported_hardware.html).
 
 We recommend that regardless of which quantization technique you are using that you pre-quantize the model.
 Runtime quantization adds additional overhead to the endpoint startup time.
@@ -53,7 +53,7 @@ The following quantization techniques are supported for runtime quantization:
 You can leverage these techniques by specifying `option.quantize=<fp8|bitsandbytes>` in serving.properties, or `OPTION_QUANTIZE=<fp8|bitsandbytes>` environment variable.
 
 Other quantization techniques supported by vLLM require ahead of time quantization to be served with LMI.
-You can find details on how to leverage those quantization techniques from the vLLM docs [here](https://docs.vllm.ai/en/v0.8.4/quantization/supported_hardware.html).
+You can find details on how to leverage those quantization techniques from the vLLM docs [here](https://docs.vllm.ai/en/v0.10.2/quantization/supported_hardware.html).
 
 ### Ahead of time (AOT) quantization
 
@@ -73,11 +73,6 @@ Async mode integrates with the vLLM Async Engine via the OpenAI modules.
 This ensures that LMI's vLLM support is always in parity with upstream vLLM with respect to both engine-configurations and API schemas.
 Async mode will become the default, and only supported mode, in an upcoming release.
 
-Currently, async mode does not support multi-lora hosting.
-This functionality will be added to async mode soon.
-If you are not using LMI's vLLM implementation to deploy multiple lora adapters, 
-async mode is the recommended deployment mode.
-
 ### Async Mode Configurations
 
 **serving.properties**
@@ -125,8 +120,6 @@ OPTION_MAX_ROLLING_BATCH_SIZE=64
 
 ### LoRA Adapter Support
 
-**Note: LoRA adapter support is only supported in rolling batch mode. It will be added to async mode soon.**
-
 vLLM has support for LoRA adapters using the [adapters API](../../adapters.md).
 In order to use the adapters, you must begin by enabling them by setting `option.enable_lora=true`.
 Following that, you can configure the LoRA support through the additional settings:
@@ -161,7 +154,7 @@ Those situations will be called out specifically.
 
 In addition to the configurations specified in the table above, LMI supports all additional vLLM EngineArguments in Pass-Through mode.
 Pass-Through configurations are not processed or validated by LMI.
-You can find the set of EngineArguments supported by vLLM [here](https://docs.vllm.ai/en/v0.8.4_a/serving/engine_args.html).
+You can find the set of EngineArguments supported by vLLM [here](https://docs.vllm.ai/en/v0.10.2_a/serving/engine_args.html).
 
 You can specify these pass-through configurations in the serving.properties file by prefixing the configuration with `option.<config>`,
 or as environment variables by prefixing the configuration with `OPTION_<CONFIG>`.