Skip to content

Commit

Permalink
[lmi][docs] update to 0.28.0 in lmi docs (deepjavalibrary#2063)
Browse files Browse the repository at this point in the history
  • Loading branch information
siddvenk authored Jun 14, 2024
1 parent 15c358b commit c5fdfec
Show file tree
Hide file tree
Showing 13 changed files with 57 additions and 36 deletions.
6 changes: 3 additions & 3 deletions serving/docs/configurations_global.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ system environment variables that user can be set for DJL Serving:
| Key | Type | Description |
|-------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| JAVA_HOME | env var | JDK home path |
| MODEL_SERVER_HOME | env var | DJLServing home directory, default: Installation directory (e.g. /usr/local/Cellar/djl-serving/0.27.0/) |
| MODEL_SERVER_HOME | env var | DJLServing home directory, default: Installation directory (e.g. /usr/local/Cellar/djl-serving/<djl-version>/) |
| DEFAULT_JVM_OPTS | env var | default: `-Dlog4j.configurationFile=${APP_HOME}/conf/log4j2.xml`<br>Override default JVM startup options and system properties. |
| JAVA_OPTS | env var | default: `-Xms1g -Xmx1g -XX:+ExitOnOutOfMemoryError`<br>Add extra JVM options. |
| SERVING_OPTS | env var | default: N/A<br>Add serving related JVM options.<br>Some of DJL configuration can only be configured by JVM system properties, user has to set DEFAULT_JVM_OPTS environment variable to configure them.<br>- `-Dai.djl.pytorch.num_interop_threads=2`, this will override interop threads for PyTorch<br>- `-Dai.djl.pytorch.num_threads=2`, this will override OMP_NUM_THREADS for PyTorch<br>- `-Dai.djl.logging.level=debug` change DJL loggging level |
Expand Down Expand Up @@ -210,12 +210,12 @@ DJLServing provides a few built-in `log4j2-XXX.xml` files in DJLServing containe
Use the following environment variable to print HTTP access log to console:

```
export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.27.0/conf/log4j2-access.xml
export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.28.0/conf/log4j2-access.xml
```

Use the following environment variable to print both access log, server metrics and model metrics to console:

```
export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.27.0/conf/log4j2-console.xml
export DEFAULT_JVM_OPTS="-Dlog4j.configurationFile=/usr/local/djl-serving-0.28.0/conf/log4j2-console.xml
```

10 changes: 5 additions & 5 deletions serving/docs/lmi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,11 +93,11 @@ This information is also available on the SageMaker DLC [GitHub repository](http

| Backend | SageMakerDLC | Example URI |
|------------------------|-----------------|------------------------------------------------------------------------------------------|
| `vLLM` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 |
| `lmi-dist` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 |
| `hf-accelerate` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 |
| `tensorrt-llm` | djl-tensorrtllm | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-tensorrtllm0.8.0-cu122 |
| `transformers-neuronx` | djl-neuronx | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-neuronx-sdk2.18.0 |
| `vLLM` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124 |
| `lmi-dist` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124 |
| `hf-accelerate` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124 |
| `tensorrt-llm` | djl-tensorrtllm | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-tensorrtllm0.9.0-cu122 |
| `transformers-neuronx` | djl-neuronx | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-neuronx-sdk2.18.2 |

## Advanced Features

Expand Down
19 changes: 19 additions & 0 deletions serving/docs/lmi/announcements/deepspeed-deprecation.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,25 @@ The `deepspeed` container has been renamed to the `lmi` container.
As part of this change, we have decided to discontinue integration with the DeepSpeed inference library.
You can continue to use vLLM or LMi-dist Library with the LMI container. If you plan to use DeepSpeed Library, please follow the steps below, or use LMI V9 (0.27.0).

## Fetching the container from SageMaker Python SDK

As part of changing the container name, we have updated the framework tag in the [SageMaker Python SDK](https://github.com/aws/sagemaker-python-sdk).

To fetch the new image uri from the SageMaker Python SDK:

```python
from sagemaker import image_uris

# New Usage: For the 0.28.0 and future containers
inference_image_uri = image_uris.retrieve(framework="djl-lmi", version="0.28.0", region=region)

# Old Usage: For the 0.27.0 and previous containers
inference_image_uri = image_uris.retrieve(framework="djl-deepspeed", version="0.27.0", region=region)
```

If you have been using the vllm or lmi-dist inference engine, this is the only change you need to make when using the SageMaker Python SDK.
If you have been using the deepspeed inference engine, continue reading for further migration steps.

## Migrating from DeepSpeed

If you are not using DeepSpeed library (through importing) or using DeepSpeed as your inference engine, you can stop reading from here.
Expand Down
2 changes: 1 addition & 1 deletion serving/docs/lmi/deployment_guide/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ A more in-depth explanation about configurations is presented in the deployment
| | HuggingFace Accelerate | LMI_dist (9.0.0) | TensorRTLLM (0.8.0) | TransformersNeuronX (2.18.0) | vLLM (0.3.3) |
|---------------------------------------|------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------|
| DLC | LMI | LMI | LMI TRTLLM | LMI Neuron | LMI |
| Default handler | [huggingface](https://github.com/deepjavalibrary/djl-serving/blob/0.27.0-dlc/engines/python/setup/djl_python/huggingface.py) | [huggingface](https://github.com/deepjavalibrary/djl-serving/blob/0.27.0-dlc/engines/python/setup/djl_python/huggingface.py) | [tensorrt-llm](https://github.com/deepjavalibrary/djl-serving/blob/0.27.0-dlc/engines/python/setup/djl_python/tensorrt_llm.py) | [transformers-neuronx](https://github.com/deepjavalibrary/djl-serving/blob/0.27.0-dlc/engines/python/setup/djl_python/transformers_neuronx.py) | [huggingface](https://github.com/deepjavalibrary/djl-serving/blob/0.27.0-dlc/engines/python/setup/djl_python/huggingface.py) |
| Default handler | [huggingface](https://github.com/deepjavalibrary/djl-serving/blob/0.28.0-dlc/engines/python/setup/djl_python/huggingface.py) | [huggingface](https://github.com/deepjavalibrary/djl-serving/blob/0.28.0-dlc/engines/python/setup/djl_python/huggingface.py) | [tensorrt-llm](https://github.com/deepjavalibrary/djl-serving/blob/0.28.0-dlc/engines/python/setup/djl_python/tensorrt_llm.py) | [transformers-neuronx](https://github.com/deepjavalibrary/djl-serving/blob/0.28.0-dlc/engines/python/setup/djl_python/transformers_neuronx.py) | [huggingface](https://github.com/deepjavalibrary/djl-serving/blob/0.28.0-dlc/engines/python/setup/djl_python/huggingface.py) |
| support quantization | BitsandBytes/GPTQ | GPTQ/AWQ | SmoothQuant, AWQ, GPTQ | INT8 | GPTQ/AWQ |
| AWS machine supported | G4/G5/G6/P4D/P5 | G5/G6/P4D/P5 | G5/G6/P4D/P5 | INF2/TRN1 | G4/G5/G6/P4D/P5 |
| execution mode | Python | MPI | MPI | Python | Python |
Expand Down
8 changes: 4 additions & 4 deletions serving/docs/lmi/deployment_guide/deploying-your-endpoint.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,8 +51,8 @@ sagemaker_session = sagemaker.session.Session()
# region is needed to retrieve the lmi container
region = sagemaker_session._region_name
# get the lmi image uri
# available frameworks: "djl-deepspeed" (for vllm, lmi-dist), "djl-tensorrtllm" (for tensorrt-llm), "djl-neuronx" (for transformers neuronx)
container_uri = sagemaker.image_uris.retrieve(framework="djl-deepspeed", version="0.27.0", region=region)
# available frameworks: "djl-lmi" (for vllm, lmi-dist), "djl-tensorrtllm" (for tensorrt-llm), "djl-neuronx" (for transformers neuronx)
container_uri = sagemaker.image_uris.retrieve(framework="djl-lmi", version="0.28.0", region=region)
# create a unique endpoint name
endpoint_name = sagemaker.utils.name_from_base("my-lmi-endpoint")
# s3 uri object prefix under which the serving.properties and optional model artifacts are stored
Expand Down Expand Up @@ -106,8 +106,8 @@ sagemaker_session = sagemaker.session.Session()
# region is needed to retrieve the lmi container
region = sagemaker_session._region_name
# get the lmi image uri
# available frameworks: "djl-deepspeed" (for vllm, lmi-dist, deepspeed), "djl-tensorrtllm" (for tensorrt-llm), "djl-neuronx" (for transformers neuronx)
container_uri = sagemaker.image_uris.retrieve(framework="djl-deepspeed", version="0.27.0", region=region)
# available frameworks: "djl-lmi" (for vllm, lmi-dist), "djl-tensorrtllm" (for tensorrt-llm), "djl-neuronx" (for transformers neuronx)
container_uri = sagemaker.image_uris.retrieve(framework="djl-lmi", version="0.28.0", region=region)
# create a unique endpoint name
endpoint_name = sagemaker.utils.name_from_base("my-lmi-endpoint")
# instance type you will deploy your model to
Expand Down
4 changes: 2 additions & 2 deletions serving/docs/lmi/deployment_guide/testing-custom-script.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ For example:

```
docker run -it -p 8080:8080 --shm-size=12g --runtime=nvidia -v /home/ubuntu/test.py:/workplace/test.py \
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 /bin/bash
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.28.0-lmi10.0.0-cu124 /bin/bash
```

### Step 2: Install DJLServing Python module
Expand All @@ -36,7 +36,7 @@ pip install git+https://github.com/deepjavalibrary/djl-serving.git#subdirectory=
### From a specific DLC version

```
pip install git+https://github.com/deepjavalibrary/djl-serving.git@0.27.0-dlc#subdirectory=engines/python/setup
pip install git+https://github.com/deepjavalibrary/djl-serving.git@0.28.0-dlc#subdirectory=engines/python/setup
```

## Tutorial 1: Running with default handler with rolling batch
Expand Down
4 changes: 2 additions & 2 deletions serving/docs/lmi/tutorials/tnx_aot_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ For example:
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com

# Download docker image
docker pull 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.27.0-neuronx-sdk2.18.0
docker pull 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.28.0-neuronx-sdk2.18.2

```

Expand Down Expand Up @@ -129,7 +129,7 @@ docker run -t --rm --network=host \
--device /dev/neuron9 \
--device /dev/neuron10 \
--device /dev/neuron11 \
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.27.0-neuronx-sdk2.18.0 \
763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.28.0-neuronx-sdk2.18.2 \
partition --model-dir /opt/ml/input/data/training --skip-copy
```

Expand Down
4 changes: 2 additions & 2 deletions serving/docs/lmi/tutorials/trtllm_aot_tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Refer [here](https://github.com/aws/deep-learning-containers/blob/master/availab
For example:

```
docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-tensorrtllm0.8.0-cu122
docker pull 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-tensorrtllm0.9.0-cu122
```

### Step 3: Set the environment variables:
Expand Down Expand Up @@ -91,7 +91,7 @@ docker run --runtime=nvidia --gpus all --shm-size 12gb \
-e OPTION_TENSOR_PARALLEL_DEGREE=$OPTION_TENSOR_PARALLEL_DEGREE \
-e OPTION_MAX_ROLLING_BATCH_SIZE=$OPTION_MAX_ROLLING_BATCH_SIZE \
-e OPTION_DTYPE=$OPTION_DTYPE \
763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-tensorrtllm0.8.0-cu122 python /opt/djl/partition/trt_llm_partition.py \
763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-tensorrtllm0.9.0-cu122 python /opt/djl/partition/trt_llm_partition.py \
--properties_dir $PWD \
--trt_llm_model_repo /tmp/trtllm \
--tensor_parallel_degree $OPTION_TENSOR_PARALLEL_DEGREE
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ docker run -it --runtime=nvidia --gpus all --shm-size 12gb \
-p 8080:8080 \
-v /opt/dlami/nvme/large_store:/opt/djl/large_store \
-v /opt/dlami/nvme/tmp/.cache:/tmp/.cache \
763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-tensorrtllm0.8.0-cu122 /bin/bash
763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.28.0-tensorrtllm0.9.0-cu122 /bin/bash
```

Here we assume you are using g5, g6, p4d, p4de or p5 machine that has NVMe disk available.
Expand Down
3 changes: 2 additions & 1 deletion serving/docs/lmi/user_guides/chat_input_output_schema.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# Chat Completions API Schema

This document describes the API schema for the chat completions endpoints (`v1/chat/completions`) when using the built-in inference handlers in LMI containers.
This schema is supported from v0.27.0 release and is compatible with [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create).
This schema is applicable to our latest release, v0.28.0, and is compatible with [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create).
Documentation for previous releases is available on our GitHub on the relevant version branch (e.g. 0.27.0-dlc).

On SageMaker, Chat Completions API schema is supported with the `/invocations` endpoint without additional configurations.
If the request contains the "messages" field, LMI will treat the request as a chat completions style request, and respond
Expand Down
3 changes: 2 additions & 1 deletion serving/docs/lmi/user_guides/lmi_input_output_schema.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
# LMI handlers Inference API Schema

This document provides the default API schema for the inference endpoints (`/invocations`, `/predictions/<model_name>`) when using the built-in inference handlers in LMI containers.
This schema is applicable to our latest release, v0.27.0.
This schema is applicable to our latest release, v0.28.0.
Documentation for previous releases is available on our GitHub on the relevant version branch (e.g. 0.27.0-dlc).

LMI provides two distinct schemas depending on what type of batching you use:

Expand Down
2 changes: 1 addition & 1 deletion serving/docs/lmi/user_guides/starting-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ sagemaker_session = sagemaker.session.Session()
region = sagemaker_session._region_name

# Fetch the uri of the LMI container that supports vLLM, LMI-Dist, HuggingFace Accelerate backends
lmi_image_uri = image_uris.retrieve(framework="djl-deepspeed", version="0.27.0", region=region)
lmi_image_uri = image_uris.retrieve(framework="djl-lmi", version="0.28.0", region=region)

# Create the SageMaker Model object. In this example we let LMI configure the deployment settings based on the model architecture
model = Model(
Expand Down
Loading

0 comments on commit c5fdfec

Please sign in to comment.