[lmi][docs] update to 0.28.0 in lmi docs (deepjavalibrary#2063)

david-sitsky · Jun 14, 2024 · c5fdfec · c5fdfec
1 parent 15c358b
commit c5fdfec
Show file tree

Hide file tree

Showing 13 changed files with 57 additions and 36 deletions.
diff --git a/serving/docs/configurations_global.md b/serving/docs/configurations_global.md
@@ -156,7 +156,7 @@ system environment variables that user can be set for DJL Serving:
 | Key               | Type    | Description                                                                                                                                                                                                                                                                                                                                                                                                                                               |
 |-------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | JAVA_HOME         | env var | JDK home path                                                                                                                                                                                                                                                                                                                                                                                                                                             |
-| MODEL_SERVER_HOME | env var | DJLServing home directory, default: Installation directory (e.g. /usr/local/Cellar/djl-serving/0.27.0/)                                                                                                                                                                                                                                                                                                                                                   |
+| MODEL_SERVER_HOME | env var | DJLServing home directory, default: Installation directory (e.g. /usr/local/Cellar/djl-serving/<djl-version>/)                                                                                                                                                                                                                                                                                                                                            |
 | DEFAULT_JVM_OPTS  | env var | default: `-Dlog4j.configurationFile=${APP_HOME}/conf/log4j2.xml`<br>Override default JVM startup options and system properties.                                                                                                                                                                                                                                                                                                                           |
 | JAVA_OPTS         | env var | default: `-Xms1g -Xmx1g -XX:+ExitOnOutOfMemoryError`<br>Add extra JVM options.                                                                                                                                                                                                                                                                                                                                                                            |
 | SERVING_OPTS      | env var | default: N/A<br>Add serving related JVM options.<br>Some of DJL configuration can only be configured by JVM system properties, user has to set DEFAULT_JVM_OPTS environment variable to configure them.<br>- `-Dai.djl.pytorch.num_interop_threads=2`, this will override interop threads for PyTorch<br>- `-Dai.djl.pytorch.num_threads=2`, this will override OMP_NUM_THREADS for PyTorch<br>- `-Dai.djl.logging.level=debug` change DJL loggging level |
@@ -210,12 +210,12 @@ DJLServing provides a few built-in `log4j2-XXX.xml` files in DJLServing containe
 Use the following environment variable to print HTTP access log to console:
 
 ```
-export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.27.0/conf/log4j2-access.xml
+export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.28.0/conf/log4j2-access.xml
 ```
 
 Use the following environment variable to print both access log, server metrics and model metrics to console:
 
 ```
-export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.27.0/conf/log4j2-console.xml
+export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.28.0/conf/log4j2-console.xml
 ```
 
diff --git a/serving/docs/lmi/README.md b/serving/docs/lmi/README.md
@@ -93,11 +93,11 @@ This information is also available on the SageMaker DLC [GitHub repository](http
 
 | Backend                | SageMakerDLC    | Example URI                                                                              |
 |------------------------|-----------------|------------------------------------------------------------------------------------------|
-| `vLLM`                 | djl-lmi         | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121  |
-| `lmi-dist`             | djl-lmi         | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121  |
-| `hf-accelerate`        | djl-lmi         | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121  |
-| `tensorrt-llm`         | djl-tensorrtllm | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-tensorrtllm0.8.0-cu122 |
-| `transformers-neuronx` | djl-neuronx     | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-neuronx-sdk2.18.0      |
+| `vLLM`                 | djl-lmi         | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124        |
+| `lmi-dist`             | djl-lmi         | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124        |
+| `hf-accelerate`        | djl-lmi         | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124        |
+| `tensorrt-llm`         | djl-tensorrtllm | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-tensorrtllm0.9.0-cu122 |
+| `transformers-neuronx` | djl-neuronx     | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-neuronx-sdk2.18.2      |
 
 ## Advanced Features
 

diff --git a/serving/docs/lmi/announcements/deepspeed-deprecation.md b/serving/docs/lmi/announcements/deepspeed-deprecation.md
@@ -8,6 +8,25 @@ The `deepspeed` container has been renamed to the `lmi` container.
 As part of this change, we have decided to discontinue integration with the DeepSpeed inference library. 
 You can continue to use vLLM or LMi-dist Library with the LMI container. If you plan to use DeepSpeed Library, please follow the steps below, or use LMI V9 (0.27.0).
 
+## Fetching the container from SageMaker Python SDK
+
+As part of changing the container name, we have updated the framework tag in the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk).
+
+To fetch the new image uri from the SageMaker Python SDK:
+
+```python
+from sagemaker import image_uris
+
+# New Usage: For the 0.28.0 and future containers
+inference_image_uri = image_uris.retrieve(framework="djl-lmi", version="0.28.0", region=region)
+
+# Old Usage: For the 0.27.0 and previous containers
+inference_image_uri = image_uris.retrieve(framework="djl-deepspeed", version="0.27.0", region=region)
+```
+
+If you have been using the vllm or lmi-dist inference engine, this is the only change you need to make when using the SageMaker Python SDK.
+If you have been using the deepspeed inference engine, continue reading for further migration steps. 
+
 ## Migrating from DeepSpeed
 
 If you are not using DeepSpeed library (through importing) or using DeepSpeed as your inference engine, you can stop reading from here.

diff --git a/serving/docs/lmi/deployment_guide/README.md b/serving/docs/lmi/deployment_guide/README.md
@@ -80,7 +80,7 @@ A more in-depth explanation about configurations is presented in the deployment
 |                                       | HuggingFace Accelerate                                                                                                       | LMI_dist (9.0.0)                                                                                                             | TensorRTLLM (0.8.0)                                                                                                            | TransformersNeuronX (2.18.0)                                                                                                                   | vLLM (0.3.3)                                                                                                                 |
 |---------------------------------------|------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
 | DLC                                   | LMI                                                                                                                          | LMI                                                                                                                          | LMI TRTLLM                                                                                                                     | LMI Neuron                                                                                                                                     | LMI                                                                                                                          |
-| Default handler                       | [huggingface](https://github.com/deepjavalibrary/djl-serving/blob/0.27.0-dlc/engines/python/setup/djl_python/huggingface.py) | [huggingface](https://github.com/deepjavalibrary/djl-serving/blob/0.27.0-dlc/engines/python/setup/djl_python/huggingface.py) | [tensorrt-llm](https://github.com/deepjavalibrary/djl-serving/blob/0.27.0-dlc/engines/python/setup/djl_python/tensorrt_llm.py) | [transformers-neuronx](https://github.com/deepjavalibrary/djl-serving/blob/0.27.0-dlc/engines/python/setup/djl_python/transformers_neuronx.py) | [huggingface](https://github.com/deepjavalibrary/djl-serving/blob/0.27.0-dlc/engines/python/setup/djl_python/huggingface.py) |
+| Default handler                       | [huggingface](https://github.com/deepjavalibrary/djl-serving/blob/0.28.0-dlc/engines/python/setup/djl_python/huggingface.py) | [huggingface](https://github.com/deepjavalibrary/djl-serving/blob/0.28.0-dlc/engines/python/setup/djl_python/huggingface.py) | [tensorrt-llm](https://github.com/deepjavalibrary/djl-serving/blob/0.28.0-dlc/engines/python/setup/djl_python/tensorrt_llm.py) | [transformers-neuronx](https://github.com/deepjavalibrary/djl-serving/blob/0.28.0-dlc/engines/python/setup/djl_python/transformers_neuronx.py) | [huggingface](https://github.com/deepjavalibrary/djl-serving/blob/0.28.0-dlc/engines/python/setup/djl_python/huggingface.py) |
 | support quantization                  | BitsandBytes/GPTQ                                                                                                            | GPTQ/AWQ                                                                                                                     | SmoothQuant, AWQ, GPTQ                                                                                                         | INT8                                                                                                                                           | GPTQ/AWQ                                                                                                                     |
 | AWS machine supported                 | G4/G5/G6/P4D/P5                                                                                                              | G5/G6/P4D/P5                                                                                                                 | G5/G6/P4D/P5                                                                                                                   | INF2/TRN1                                                                                                                                      | G4/G5/G6/P4D/P5                                                                                                              |
 | execution mode                        | Python                                                                                                                       | MPI                                                                                                                          | MPI                                                                                                                            | Python                                                                                                                                         | Python                                                                                                                       |

diff --git a/serving/docs/lmi/deployment_guide/deploying-your-endpoint.md b/serving/docs/lmi/deployment_guide/deploying-your-endpoint.md
@@ -51,8 +51,8 @@ sagemaker_session = sagemaker.session.Session()
 # region is needed to retrieve the lmi container
 region = sagemaker_session._region_name
 # get the lmi image uri
-# available frameworks: "djl-deepspeed" (for vllm, lmi-dist), "djl-tensorrtllm" (for tensorrt-llm), "djl-neuronx" (for transformers neuronx)
-container_uri = sagemaker.image_uris.retrieve(framework="djl-deepspeed", version="0.27.0", region=region)
+# available frameworks: "djl-lmi" (for vllm, lmi-dist), "djl-tensorrtllm" (for tensorrt-llm), "djl-neuronx" (for transformers neuronx)
+container_uri = sagemaker.image_uris.retrieve(framework="djl-lmi", version="0.28.0", region=region)
 # create a unique endpoint name
 endpoint_name = sagemaker.utils.name_from_base("my-lmi-endpoint")
 # s3 uri object prefix under which the serving.properties and optional model artifacts are stored
@@ -106,8 +106,8 @@ sagemaker_session = sagemaker.session.Session()
 # region is needed to retrieve the lmi container
 region = sagemaker_session._region_name
 # get the lmi image uri
-# available frameworks: "djl-deepspeed" (for vllm, lmi-dist, deepspeed), "djl-tensorrtllm" (for tensorrt-llm), "djl-neuronx" (for transformers neuronx)
-container_uri = sagemaker.image_uris.retrieve(framework="djl-deepspeed", version="0.27.0", region=region)
+# available frameworks: "djl-lmi" (for vllm, lmi-dist), "djl-tensorrtllm" (for tensorrt-llm), "djl-neuronx" (for transformers neuronx)
+container_uri = sagemaker.image_uris.retrieve(framework="djl-lmi", version="0.28.0", region=region)
 # create a unique endpoint name
 endpoint_name = sagemaker.utils.name_from_base("my-lmi-endpoint")
 # instance type you will deploy your model to

diff --git a/serving/docs/lmi/deployment_guide/testing-custom-script.md b/serving/docs/lmi/deployment_guide/testing-custom-script.md
@@ -20,7 +20,7 @@ For example:
 
 ```
 docker run -it -p 8080:8080 --shm-size=12g --runtime=nvidia -v /home/ubuntu/test.py:/workplace/test.py \
-763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 /bin/bash
+763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124 /bin/bash
 ```
 
 ### Step 2: Install DJLServing Python module
@@ -36,7 +36,7 @@ pip install git+https://github.com/deepjavalibrary/djl-serving.git#subdirectory=
 ### From a specific DLC version
 
 ```
-pip install git+https://github.com/deepjavalibrary/djl-serving.git@0.27.0-dlc#subdirectory=engines/python/setup
+pip install git+https://github.com/deepjavalibrary/djl-serving.git@0.28.0-dlc#subdirectory=engines/python/setup
 ```
 
 ## Tutorial 1: Running with default handler with rolling batch

diff --git a/serving/docs/lmi/tutorials/tnx_aot_tutorial.md b/serving/docs/lmi/tutorials/tnx_aot_tutorial.md
@@ -42,7 +42,7 @@ For example:
 aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com
 
 # Download docker image
-docker pull 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.27.0-neuronx-sdk2.18.0
+docker pull 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.28.0-neuronx-sdk2.18.2
 
 ```
 
@@ -129,7 +129,7 @@ docker run -t --rm --network=host \
   --device /dev/neuron9 \
   --device /dev/neuron10 \
   --device /dev/neuron11 \
-  763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.27.0-neuronx-sdk2.18.0 \
+  763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.28.0-neuronx-sdk2.18.2 \
   partition --model-dir /opt/ml/input/data/training --skip-copy
 ```
 

diff --git a/serving/docs/lmi/tutorials/trtllm_aot_tutorial.md b/serving/docs/lmi/tutorials/trtllm_aot_tutorial.md
@@ -42,7 +42,7 @@ Refer [here](https://github.com/aws/deep-learning-containers/blob/master/availab
 For example:
 
 ```
-docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-tensorrtllm0.8.0-cu122
+docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-tensorrtllm0.9.0-cu122
 ```
 
 ### Step 3: Set the environment variables:
@@ -91,7 +91,7 @@ docker run --runtime=nvidia --gpus all --shm-size 12gb \
 -e OPTION_TENSOR_PARALLEL_DEGREE=$OPTION_TENSOR_PARALLEL_DEGREE \
 -e OPTION_MAX_ROLLING_BATCH_SIZE=$OPTION_MAX_ROLLING_BATCH_SIZE \
 -e OPTION_DTYPE=$OPTION_DTYPE \
- 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-tensorrtllm0.8.0-cu122 python /opt/djl/partition/trt_llm_partition.py \
+ 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-tensorrtllm0.9.0-cu122 python /opt/djl/partition/trt_llm_partition.py \
 --properties_dir $PWD \
 --trt_llm_model_repo /tmp/trtllm \
 --tensor_parallel_degree $OPTION_TENSOR_PARALLEL_DEGREE

diff --git a/serving/docs/lmi/tutorials/trtllm_finding_max_num_tokens_tutorial.md b/serving/docs/lmi/tutorials/trtllm_finding_max_num_tokens_tutorial.md
@@ -123,7 +123,7 @@ docker run -it --runtime=nvidia --gpus all --shm-size 12gb \
 -p 8080:8080 \
 -v /opt/dlami/nvme/large_store:/opt/djl/large_store \
 -v /opt/dlami/nvme/tmp/.cache:/tmp/.cache \
-763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-tensorrtllm0.8.0-cu122 /bin/bash
+763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-tensorrtllm0.9.0-cu122 /bin/bash
 ```
 
 Here we assume you are using g5, g6, p4d, p4de or p5 machine that has NVMe disk available. 

diff --git a/serving/docs/lmi/user_guides/chat_input_output_schema.md b/serving/docs/lmi/user_guides/chat_input_output_schema.md
@@ -1,7 +1,8 @@
 # Chat Completions API Schema
 
 This document describes the API schema for the chat completions endpoints (`v1/chat/completions`) when using the built-in inference handlers in LMI containers.
-This schema is supported from v0.27.0 release and is compatible with [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create).
+This schema is applicable to our latest release, v0.28.0, and is compatible with [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create).
+Documentation for previous releases is available on our GitHub on the relevant version branch (e.g. 0.27.0-dlc).
 
 On SageMaker, Chat Completions API schema is supported with the `/invocations` endpoint without additional configurations.
 If the request contains the "messages" field, LMI will treat the request as a chat completions style request, and respond

diff --git a/serving/docs/lmi/user_guides/lmi_input_output_schema.md b/serving/docs/lmi/user_guides/lmi_input_output_schema.md
@@ -1,7 +1,8 @@
 # LMI handlers Inference API Schema
 
 This document provides the default API schema for the inference endpoints (`/invocations`, `/predictions/<model_name>`) when using the built-in inference handlers in LMI containers.
-This schema is applicable to our latest release, v0.27.0.
+This schema is applicable to our latest release, v0.28.0.
+Documentation for previous releases is available on our GitHub on the relevant version branch (e.g. 0.27.0-dlc).
 
 LMI provides two distinct schemas depending on what type of batching you use:
 

diff --git a/serving/docs/lmi/user_guides/starting-guide.md b/serving/docs/lmi/user_guides/starting-guide.md
@@ -29,7 +29,7 @@ sagemaker_session = sagemaker.session.Session()
 region = sagemaker_session._region_name
 
 # Fetch the uri of the LMI container that supports vLLM, LMI-Dist, HuggingFace Accelerate backends
-lmi_image_uri = image_uris.retrieve(framework="djl-deepspeed", version="0.27.0", region=region)
+lmi_image_uri = image_uris.retrieve(framework="djl-lmi", version="0.28.0", region=region)
 
 # Create the SageMaker Model object. In this example we let LMI configure the deployment settings based on the model architecture  
 model = Model(