Skip to content

Commit

Permalink
[docs][lmi] update lmi docs in general for 0.28.0 (deepjavalibrary#2003)
Browse files Browse the repository at this point in the history
  • Loading branch information
siddvenk authored Jun 4, 2024
1 parent ee540be commit 2a747c8
Show file tree
Hide file tree
Showing 16 changed files with 92 additions and 89 deletions.
2 changes: 1 addition & 1 deletion serving/docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ export DJL_VERSION=$(cat ../../gradle.properties | awk -F '=' '/djl_version/ {pr
docker compose build --build-arg djl_version=${DJL_VERSION} <compose-target>
```

You can find different `compose-target` in `docker-compose.yml`, like `cpu`, `deepspeed`...
You can find different `compose-target` in `docker-compose.yml`, like `cpu`, `lmi`...

## Run docker image

Expand Down
4 changes: 2 additions & 2 deletions serving/docs/configurations_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ An example `serving.properties` can be found [here](https://github.com/deepjaval
In `serving.properties`, you can set the following properties. Model properties are accessible to `Translator`
and python handler functions.

- `engine`: Which Engine to use, values include MXNet, PyTorch, TensorFlow, ONNX, DeepSpeed, etc.
- `engine`: Which Engine to use, values include MXNet, PyTorch, TensorFlow, ONNX, etc.
- `load_on_devices`: A ; delimited devices list, which the model to be loaded on, default to load on all devices.
- `translatorFactory`: Specify the TranslatorFactory.
- `job_queue_size`: Specify the job queue size at model level, this will override global `job_queue_size`, default is `1000`.
Expand Down Expand Up @@ -62,7 +62,7 @@ option.ortDevice=TensorRT/ROCM/CoreML
retry_threshold=10
option.pythonExecutable=python3
option.entryPoint=deepspeed.py
option.entryPoint=huggingface.py
option.handler=hanlde
option.predict_timeout=120
option.model_loading_timeout=10
Expand Down
16 changes: 7 additions & 9 deletions serving/docs/lmi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,16 +9,16 @@

# Overview - Large Model Inference (LMI) Containers

LMI containers are a set of high performance Docker Containers purpose built for large language model (LLM) inference.
With these containers you can leverage high performance open-source inference libraries like [vLLM](https://github.com/vllm-project/vllm), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM),
[DeepSpeed](https://github.com/microsoft/DeepSpeed), [Transformers NeuronX](https://github.com/aws-neuron/transformers-neuronx) to deploy LLMs on [AWS SageMaker Endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html).
LMI containers are a set of high-performance Docker Containers purpose built for large language model (LLM) inference.
With these containers, you can leverage high performance open-source inference libraries like [vLLM](https://github.com/vllm-project/vllm), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM),
[Transformers NeuronX](https://github.com/aws-neuron/transformers-neuronx) to deploy LLMs on [AWS SageMaker Endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html).
These containers bundle together a model server with open-source inference libraries to deliver an all-in-one LLM serving solution.
We provide quick start notebooks that get you deploying popular open source models in minutes, and advanced guides to maximize performance of your endpoint.

LMI containers provide many features, including:

* Optimized inference performance for popular model architectures like Llama, Bloom, Falcon, T5, Mixtral, and more
* Integration with open source inference libraries like vLLM, TensorRT-LLM, DeepSpeed, and Transformers NeuronX
* Integration with open source inference libraries like vLLM, TensorRT-LLM, and Transformers NeuronX
* Continuous Batching for maximizing throughput at high concurrency
* Token Streaming
* Quantization through AWQ, GPTQ, and SmoothQuant
Expand Down Expand Up @@ -76,7 +76,6 @@ It is intended for users moving towards deploying LLMs in production settings.
LMI Containers provide integration with multiple inference libraries.
You can learn more about their integration with LMI from the respective user guides:

* [DeepSpeed - User Guide](user_guides/deepspeed_user_guide.md)
* [vLLM - User Guide](user_guides/vllm_user_guide.md)
* [LMI-Dist - User Guide](user_guides/lmi-dist_user_guide.md)
* [TensorRT-LLM - User Guide](user_guides/trt_llm_user_guide.md)
Expand All @@ -94,10 +93,9 @@ This information is also available on the SageMaker DLC [GitHub repository](http

| Backend | SageMakerDLC | Example URI |
|------------------------|-----------------|------------------------------------------------------------------------------------------|
| `vLLM` | djl-deepspeed | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 |
| `lmi-dist` | djl-deepspeed | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 |
| `hf-accelerate` | djl-deepspeed | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 |
| `deepspeed` | djl-deepspeed | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 |
| `vLLM` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 |
| `lmi-dist` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 |
| `hf-accelerate` | djl-lmi | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121 |
| `tensorrt-llm` | djl-tensorrtllm | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-tensorrtllm0.8.0-cu122 |
| `transformers-neuronx` | djl-neuronx | 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-neuronx-sdk2.18.0 |

Expand Down
Loading

0 comments on commit 2a747c8

Please sign in to comment.