From e2aa4bd7a434944f2f95f7f55a8272ebc786bf2f Mon Sep 17 00:00:00 2001
From: Lyu Han <lvhan_028@163.com>
Date: Fri, 13 Sep 2024 11:11:06 +0800
Subject: [PATCH] bump version to v0.6.0 (#2445)

* bump version to 0.6.0

* update readme

* update supported models

* update get_started on ascend platform
---
 README.md                                     |  6 +-
 README_zh-CN.md                               |  6 +-
 docker/Dockerfile_aarch64_ascend              |  2 +-
 docs/en/get_started/ascend/get_started.md     | 52 ++++++++---------
 docs/en/get_started/installation.md           |  2 +-
 docs/en/supported_models/supported_models.md  | 22 +++++++-
 docs/zh_cn/get_started/ascend/get_started.md  | 56 ++++++++-----------
 docs/zh_cn/get_started/installation.md        |  2 +-
 .../supported_models/supported_models.md      | 22 +++++++-
 lmdeploy/version.py                           |  2 +-
 10 files changed, 97 insertions(+), 75 deletions(-)
diff --git a/README.md b/README.md
index e26d120a71..49df79a72f 100644
--- a/README.md
+++ b/README.md
@@ -26,8 +26,10 @@ ______________________________________________________________________
 <details open>
 <summary><b>2024</b></summary>
 
-- \[2024/08\] 🔥🔥 LMDeploy is integrated into [modelscope/swift](https://github.com/modelscope/swift) as the default accelerator for VLMs inference
-- \[2024/07\] 🎉🎉 Support Llama3.1 8B, 70B and its TOOLS CALLING
+- \[2024/09\] LMDeploy PyTorchEngine adds support for[Huawei Ascend](./docs/en/get_started/ascend/get_started.md). See supported models [here](docs/en/supported_models/supported_models.md)
+- \[2024/09\] LMDeploy PyTorchEngine achieves 1.3x faster on Llama3-8B inference by introducing CUDA graph
+- \[2024/08\] LMDeploy is integrated into [modelscope/swift](https://github.com/modelscope/swift) as the default accelerator for VLMs inference
+- \[2024/07\] Support Llama3.1 8B, 70B and its TOOLS CALLING
 - \[2024/07\] Support [InternVL2](docs/en/multi_modal/internvl.md) full-series models, [InternLM-XComposer2.5](docs/en/multi_modal/xcomposer2d5.md) and [function call](docs/en/llm/api_server_tools.md) of InternLM2.5
 - \[2024/06\] PyTorch engine support DeepSeek-V2 and several VLMs, such as CogVLM2, Mini-InternVL, LlaVA-Next
 - \[2024/05\] Balance vision model when deploying VLMs with multiple GPUs
diff --git a/README_zh-CN.md b/README_zh-CN.md
index 7332241676..0716fcbad9 100644
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -26,8 +26,10 @@ ______________________________________________________________________
 <details open>
 <summary><b>2024</b></summary>
 
-- \[2024/08\] 🔥🔥 LMDeploy现已集成至 [modelscope/swift](https://github.com/modelscope/swift)，成为 VLMs 推理的默认加速引擎
-- \[2024/07\] 🎉🎉 支持 Llama3.1 8B 和 70B 模型，以及工具调用功能
+- \[2024/09\] LMDeploy PyTorchEngine 增加了对 [华为 Ascend](docs/zh_cn/get_started/ascend/get_started.md) 的支持。支持的模型请见[这里](docs/zh_cn/supported_models/supported_models.md)
+- \[2024/09\] 通过引入 CUDA Graph，LMDeploy PyTorchEngine 在 Llama3-8B 推理上实现了 1.3 倍的加速
+- \[2024/08\] LMDeploy现已集成至 [modelscope/swift](https://github.com/modelscope/swift)，成为 VLMs 推理的默认加速引擎
+- \[2024/07\] 支持 Llama3.1 8B 和 70B 模型，以及工具调用功能
 - \[2024/07\] 支持 [InternVL2](docs/zh_cn/multi_modal/internvl.md) 全系列模型，[InternLM-XComposer2.5](docs/zh_cn/multi_modal/xcomposer2d5.md) 模型和 InternLM2.5 的 [function call 功能](docs/zh_cn/llm/api_server_tools.md)
 - \[2024/06\] PyTorch engine 支持了 DeepSeek-V2 和若干 VLM 模型推理, 比如 CogVLM2，Mini-InternVL，LlaVA-Next
 - \[2024/05\] 在多 GPU 上部署 VLM 模型时，支持把视觉部分的模型均分到多卡上
diff --git a/docker/Dockerfile_aarch64_ascend b/docker/Dockerfile_aarch64_ascend
index 8e1825b37d..fe7f0c8e2a 100644
--- a/docker/Dockerfile_aarch64_ascend
+++ b/docker/Dockerfile_aarch64_ascend
@@ -106,7 +106,7 @@ RUN echo "source /usr/local/Ascend/ascend-toolkit/set_env.sh" >> ~/.bashrc && \
 # timm is required for internvl2 model
 RUN --mount=type=cache,target=/root/.cache/pip \
     pip3 install transformers>=4.41.0 timm && \
-    pip3 install dlinfer-ascend==0.1.0
+    pip3 install dlinfer-ascend==0.1.0.post1
 
 # lmdeploy
 FROM build_temp as copy_temp
diff --git a/docs/en/get_started/ascend/get_started.md b/docs/en/get_started/ascend/get_started.md
index eeb1371ea0..a60d85f4d7 100644
--- a/docs/en/get_started/ascend/get_started.md
+++ b/docs/en/get_started/ascend/get_started.md
@@ -1,61 +1,53 @@
-# Get Started with Huawei Ascend (Atlas 800T A2）
+# Get Started with Huawei Ascend (Atlas 800T A2)
 
 The usage of lmdeploy on a Huawei Ascend device is almost the same as its usage on CUDA with PytorchEngine in lmdeploy.
 Please read the original [Get Started](../get_started.md) guide before reading this tutorial.
 
 ## Installation
 
-### Environment Preparation
+We highly recommend that users build a Docker image for streamlined environment setup.
 
-#### Drivers and Firmware
+Git clone the source code of lmdeploy and the Dockerfile locates in the `docker` directory:
 
-The host machine needs to install the Huawei driver and firmware version 23.0.3, refer to
-[CANN Driver and Firmware Installation](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha003/softwareinst/instg/instg_0019.html)
-and [download resources](https://www.hiascend.com/hardware/firmware-drivers/community?product=4&model=26&cann=8.0.RC3.alpha001&driver=1.0.0.2.alpha).
+```shell
+git clone https://github.com/InternLM/lmdeploy.git
+cd lmdeploy
+```
 
-#### CANN
+### Environment Preparation
+
+The Docker version is supposed to be no less than `18.03`. And `Ascend Docker Runtime` should be installed by following [the official guide](https://www.hiascend.com/document/detail/zh/mindx-dl/60rc2/clusterscheduling/clusterschedulingig/.clusterschedulingig/dlug_installation_012.html).
 
-File `docker/Dockerfile_aarch64_ascend` does not provide Ascend CANN installation package, users need to download the CANN (version 8.0.RC3.alpha001) software packages from [Ascend Resource Download Center](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC3.alpha001) themselves. And place the Ascend-cann-kernels-910b\*.run and Ascend-cann-toolkit\*-aarch64.run under the directory where the docker build command is executed.
+#### Ascend Drivers, Firmware and CANN
 
-#### Docker
+The target machine needs to install the Huawei driver and firmware version 23.0.3, refer to
+[CANN Driver and Firmware Installation](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha003/softwareinst/instg/instg_0019.html)
+and [download resources](https://www.hiascend.com/hardware/firmware-drivers/community?product=4&model=26&cann=8.0.RC3.alpha001&driver=1.0.0.2.alpha).
 
-Building the aarch64_ascend image requires Docker >= 18.03
+And the CANN (version 8.0.RC3.alpha001) software packages should also be downloaded from [Ascend Resource Download Center](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC3.alpha001) themselves. Make sure to place the `Ascend-cann-kernels-910b*.run` and `Ascend-cann-toolkit*-aarch64.run` under the root directory of lmdeploy source code
 
-#### Reference Command for Building the Image
+#### Build Docker Image
 
-The following reference command for building the image is based on the lmdeploy source code root directory as the current directory, and the CANN-related installation packages are also placed under this directory.
+Run the following command in the root directory of lmdeploy to build the image:
 
 ```bash
-DOCKER_BUILDKIT=1 docker build -t lmdeploy-aarch64-ascend:v0.1 \
+DOCKER_BUILDKIT=1 docker build -t lmdeploy-aarch64-ascend:latest \
     -f docker/Dockerfile_aarch64_ascend .
 ```
 
-This image will install lmdeploy to `/workspace/lmdeploy` directory using `pip install --no-build-isolation -e .` command.
-
-#### Using the Image
-
-You can refer to the [documentation](https://www.hiascend.com/document/detail/zh/mindx-dl/60rc1/clusterscheduling/dockerruntimeug/dlruntime_ug_013.html)
-for usage. It is recommended to install Ascend Docker Runtime.
-Here is an example of starting container for Huawei Ascend device with Ascend Docker Runtime installed:
+If the following command executes without any errors, it indicates that the environment setup is successful.
 
 ```bash
-docker run -e ASCEND_VISIBLE_DEVICES=0 --net host -td --entry-point bash --name lmdeploy_ascend_demo \
-    lmdeploy-aarch64-ascend:v0.1  # docker_image_sha_or_name
+docker run -e ASCEND_VISIBLE_DEVICES=0 --rm --name lmdeploy -t lmdeploy-aarch64-ascend:latest lmdeploy check_env
 ```
 
-#### Pip install
-
-If you have lmdeploy installed and all Huawei environments are ready, you can run the following command to enable lmdeploy to run on Huawei Ascend devices. (Not necessary if you use the Docker image.)
-
-```bash
-pip install dlinfer-ascend
-```
+For more information about running the Docker client on Ascend devices, please refer to the [guide](https://www.hiascend.com/document/detail/zh/mindx-dl/60rc1/clusterscheduling/dockerruntimeug/dlruntime_ug_013.html)
 
 ## Offline batch inference
 
 ### LLM inference
 
-Set `device_type="ascend"`  in the `PytorchEngineConfig`:
+Set `device_type="ascend"` in the `PytorchEngineConfig`:
 
 ```python
 from lmdeploy import pipeline
diff --git a/docs/en/get_started/installation.md b/docs/en/get_started/installation.md
index 06ef777e2c..7116ab2832 100644
--- a/docs/en/get_started/installation.md
+++ b/docs/en/get_started/installation.md
@@ -23,7 +23,7 @@ pip install lmdeploy
 The default prebuilt package is compiled on **CUDA 12**. If CUDA 11+ (>=11.3) is required, you can install lmdeploy by:
 
 ```shell
-export LMDEPLOY_VERSION=0.6.0a0
+export LMDEPLOY_VERSION=0.6.0
 export PYTHON_VERSION=38
 pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
 ```
diff --git a/docs/en/supported_models/supported_models.md b/docs/en/supported_models/supported_models.md
index 961d8af585..8729b43c76 100644
--- a/docs/en/supported_models/supported_models.md
+++ b/docs/en/supported_models/supported_models.md
@@ -1,6 +1,8 @@
 # Supported Models
 
-## Models supported by TurboMind
+The following tables detail the models supported by LMDeploy's TurboMind engine and PyTorch engine across different platforms.
+
+## TurboMind on CUDA Platform
 
 |         Model         |    Size     | Type | FP16/BF16 | KV INT8 | KV INT4 | W4A16 |
 | :-------------------: | :---------: | :--: | :-------: | :-----: | :-----: | :---: |
@@ -38,7 +40,7 @@
 The TurboMind engine doesn't support window attention. Therefore, for models that have applied window attention and have the corresponding switch "use_sliding_window" enabled, such as Mistral, Qwen1.5 and etc., please choose the PyTorch engine for inference.
 ```
 
-## Models supported by PyTorch
+## PyTorchEngine on CUDA Platform
 
 |     Model      |    Size     | Type | FP16/BF16 | KV INT8 | W8A8 | W4A16 |
 | :------------: | :---------: | :--: | :-------: | :-----: | :--: | :---: |
@@ -79,3 +81,19 @@ The TurboMind engine doesn't support window attention. Therefore, for models tha
 |  Phi-3.5-mini  |    3.8B     | LLM  |    Yes    |   No    |  No  |   -   |
 |  Phi-3.5-MoE   |   16x3.8B   | LLM  |    Yes    |   No    |  No  |   -   |
 | Phi-3.5-vision |    4.2B     | MLLM |    Yes    |   No    |  No  |   -   |
+
+## PyTorchEngine on Huawei Ascend Platform
+
+|     Model      |   Size   | Type | FP16/BF16 |
+| :------------: | :------: | :--: | :-------: |
+|     Llama2     | 7B - 70B | LLM  |    Yes    |
+|     Llama3     |    8B    | LLM  |    Yes    |
+|    Llama3.1    |    8B    | LLM  |    Yes    |
+|   InternLM2    | 7B - 20B | LLM  |    Yes    |
+|  InternLM2.5   | 7B - 20B | LLM  |    Yes    |
+|    Mixtral     |   8x7B   | LLM  |    Yes    |
+|  QWen1.5-MoE   |  A2.7B   | LLM  |    Yes    |
+|     QWen2      |    7B    | LLM  |    Yes    |
+|   QWen2-MoE    | A14.57B  | LLM  |    Yes    |
+| InternVL(v1.5) |  2B-26B  | MLLM |    Yes    |
+|   InternVL2    |  1B-40B  | MLLM |    Yes    |
diff --git a/docs/zh_cn/get_started/ascend/get_started.md b/docs/zh_cn/get_started/ascend/get_started.md
index 01626e49d6..ad9ff791ff 100644
--- a/docs/zh_cn/get_started/ascend/get_started.md
+++ b/docs/zh_cn/get_started/ascend/get_started.md
@@ -1,57 +1,47 @@
-# 华为昇腾(Atlas 800T A2）
+# 华为昇腾（Atlas 800T A2）
 
-我们采用了LMDeploy中的PytorchEngine后端支持了华为昇腾设备，
-所以在华为昇腾上使用lmdeploy的方法与在英伟达GPU上使用PytorchEngine后端的使用方法几乎相同。
-在阅读本教程之前，请先阅读原版的[快速开始](../get_started.md)。
+我们基于 LMDeploy 的 PytorchEngine，增加了华为昇腾设备的支持。所以，在华为昇腾上使用 LDMeploy 的方法与在英伟达 GPU 上使用 PytorchEngine 后端的方法几乎相同。在阅读本教程之前，请先阅读原版的[快速开始](../get_started.md)。
 
 ## 安装
 
-### 环境准备
+我们强烈建议用户构建一个 Docker 镜像以简化环境设置。
 
-#### Drivers和Firmware
+克隆 lmdeploy 的源代码，Dockerfile 位于 docker 目录中。
 
-Host需要安装华为驱动程序和固件版本23.0.3，请参考
-[CANN 驱动程序和固件安装](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha003/softwareinst/instg/instg_0019.html)
-和[下载资源](https://www.hiascend.com/hardware/firmware-drivers/community?product=4&model=26&cann=8.0.RC3.alpha001&driver=1.0.0.2.alpha)。
+```shell
+git clone https://github.com/InternLM/lmdeploy.git
+cd lmdeploy
+```
 
-#### CANN
+### 环境准备
+
+Docker 版本应不低于 18.03。并且需按照[官方指南](https://www.hiascend.com/document/detail/zh/mindx-dl/60rc2/clusterscheduling/clusterschedulingig/clusterschedulingig/dlug_installation_012.html)安装 Ascend Docker Runtime。
 
-`docker/Dockerfile_aarch64_ascend`没有提供CANN 安装包，用户需要自己从[昇腾资源下载中心](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC3.alpha001)下载CANN(8.0.RC3.alpha001)软件包。
-并将Ascend-cann-kernels-910b\*.run 和 Ascend-cann-toolkit\*-aarch64.run 放在执行`docker build`命令的目录下。
+#### Drivers，Firmware 和 CANN
 
-#### Docker
+目标机器需安装华为驱动程序和固件版本 23.0.3，请参考
+[CANN 驱动程序和固件安装](https://www.hiascend.com/document/detail/zh/CANNCommunityEdition/80RC1alpha003/softwareinst/instg/instg_0019.html)
+和[下载资源](https://www.hiascend.com/hardware/firmware-drivers/community?product=4&model=26&cann=8.0.RC3.alpha001&driver=1.0.0.2.alpha)。
 
-构建aarch64_ascend镜像需要Docker>=18.03
+另外，`docker/Dockerfile_aarch64_ascend`没有提供CANN 安装包，用户需要自己从[昇腾资源下载中心](https://www.hiascend.com/developer/download/community/result?module=cann&cann=8.0.RC3.alpha001)下载CANN(8.0.RC3.alpha001)软件包。
+并将``` Ascend-cann-kernels-910b*.run`` 和  ```Ascend-cann-toolkit\*-aarch64.run\`\` 放在 lmdeploy 源码根目录下。
 
-#### 构建镜像的命令
+#### 构建镜像
 
-请在lmdeploy源代码根目录下执行以下镜像构建命令，CANN相关的安装包也放在此目录下。
+请在 lmdeploy源 代码根目录下执行以下镜像构建命令，CANN 相关的安装包也放在此目录下。
 
 ```bash
-DOCKER_BUILDKIT=1 docker build -t lmdeploy-aarch64-ascend:v0.1 \
+DOCKER_BUILDKIT=1 docker build -t lmdeploy-aarch64-ascend:latest \
     -f docker/Dockerfile_aarch64_ascend .
 ```
 
-这个镜像将使用`pip install --no-build-isolation -e .`命令将lmdeploy安装到/workspace/lmdeploy目录。
-
-#### 镜像的使用
-
-关于镜像的使用方式，请参考这篇[文档](https://www.hiascend.com/document/detail/zh/mindx-dl/60rc1/clusterscheduling/dockerruntimeug/dlruntime_ug_013.html)。
-并且在使用镜像前安装Ascend Docker Runtime。
-以下是在安装了 Ascend Docker Runtime 的情况下，启动用于华为昇腾设备的容器的示例：
+如果以下命令执行没有任何错误，这表明环境设置成功。
 
 ```bash
-docker run -e ASCEND_VISIBLE_DEVICES=0 --net host -td --entry-point bash --name lmdeploy_ascend_demo \
-    lmdeploy-aarch64-ascend:v0.1  # docker_image_sha_or_name
+docker run -e ASCEND_VISIBLE_DEVICES=0 --rm --name lmdeploy -t lmdeploy-aarch64-ascend:latest lmdeploy check_env
 ```
 
-#### 使用Pip安装
-
-如果您已经安装了lmdeploy并且所有华为环境都已准备好，您可以运行以下命令使lmdeploy能够在华为昇腾设备上运行。(如果使用Docker镜像则不需要)
-
-```bash
-pip install dlinfer-ascend
-```
+关于在昇腾设备上运行`docker run`命令的详情，请参考这篇[文档](https://www.hiascend.com/document/detail/zh/mindx-dl/60rc1/clusterscheduling/dockerruntimeug/dlruntime_ug_013.html)。
 
 ## 离线批处理
 
diff --git a/docs/zh_cn/get_started/installation.md b/docs/zh_cn/get_started/installation.md
index 42a32b6b3c..30d08cd9ef 100644
--- a/docs/zh_cn/get_started/installation.md
+++ b/docs/zh_cn/get_started/installation.md
@@ -23,7 +23,7 @@ pip install lmdeploy
 默认的预构建包是在 **CUDA 12** 上编译的。如果需要 CUDA 11+ (>=11.3)，你可以使用以下命令安装 lmdeploy：
 
 ```shell
-export LMDEPLOY_VERSION=0.6.0a0
+export LMDEPLOY_VERSION=0.6.0
 export PYTHON_VERSION=38
 pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
 ```
diff --git a/docs/zh_cn/supported_models/supported_models.md b/docs/zh_cn/supported_models/supported_models.md
index 9cf1a1df90..70ba4ed4d4 100644
--- a/docs/zh_cn/supported_models/supported_models.md
+++ b/docs/zh_cn/supported_models/supported_models.md
@@ -1,6 +1,8 @@
 # 支持的模型
 
-## TurboMind 支持的模型
+以下列表分别为 LMDeploy TurboMind 引擎和 PyTorch 引擎在不同软硬件平台下支持的模型
+
+## TurboMind CUDA 平台
 
 |         Model         |    Size     | Type | FP16/BF16 | KV INT8 | KV INT4 | W4A16 |
 | :-------------------: | :---------: | :--: | :-------: | :-----: | :-----: | :---: |
@@ -38,7 +40,7 @@
 turbomind 引擎不支持 window attention。所以，对于应用了 window attention，并开启了对应的开关"use_sliding_window"的模型，比如 Mistral、Qwen1.5 等，在推理时，请选择 pytorch engine
 ```
 
-### PyTorch 支持的模型
+## PyTorchEngine CUDA 平台
 
 |     Model      |    Size     | Type | FP16/BF16 | KV INT8 | W8A8 | W4A16 |
 | :------------: | :---------: | :--: | :-------: | :-----: | :--: | :---: |
@@ -79,3 +81,19 @@ turbomind 引擎不支持 window attention。所以，对于应用了 window att
 |  Phi-3.5-mini  |    3.8B     | LLM  |    Yes    |   No    |  No  |   -   |
 |  Phi-3.5-MoE   |   16x3.8B   | LLM  |    Yes    |   No    |  No  |   -   |
 | Phi-3.5-vision |    4.2B     | MLLM |    Yes    |   No    |  No  |   -   |
+
+## PyTorchEngine 华为昇腾平台
+
+|     Model      |   Size   | Type | FP16/BF16 |
+| :------------: | :------: | :--: | :-------: |
+|     Llama2     | 7B - 70B | LLM  |    Yes    |
+|     Llama3     |    8B    | LLM  |    Yes    |
+|    Llama3.1    |    8B    | LLM  |    Yes    |
+|   InternLM2    | 7B - 20B | LLM  |    Yes    |
+|  InternLM2.5   | 7B - 20B | LLM  |    Yes    |
+|    Mixtral     |   8x7B   | LLM  |    Yes    |
+|  QWen1.5-MoE   |  A2.7B   | LLM  |    Yes    |
+|     QWen2      |    7B    | LLM  |    Yes    |
+|   QWen2-MoE    | A14.57B  | LLM  |    Yes    |
+| InternVL(v1.5) |  2B-26B  | MLLM |    Yes    |
+|   InternVL2    |  1B-40B  | MLLM |    Yes    |
diff --git a/lmdeploy/version.py b/lmdeploy/version.py
index 51c5971332..199e3ce8e0 100644
--- a/lmdeploy/version.py
+++ b/lmdeploy/version.py
@@ -1,7 +1,7 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 from typing import Tuple
 
-__version__ = '0.6.0a0'
+__version__ = '0.6.0'
 short_version = __version__