bump version to v0.4.0 (#1469)

* bump version to v0.4.0 * update * update news * update news * update supported models
InternLM · Apr 23, 2024 · 04ba0ff · 04ba0ff
1 parent 6b8718d
commit 04ba0ff
Show file tree

Hide file tree

Showing 6 changed files with 18 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -26,6 +26,7 @@ ______________________________________________________________________
 <details open>
 <summary><b>2024</b></summary>
 
+- \[2024/04\] Support Llama3 and more VLMs, such as InternVL v1.1, v1.2, MiniGemini, InternLMXComposer2.
 - \[2024/04\] TurboMind adds online int8/int4 KV cache quantization and inference for all supported devices. Refer [here](docs/en/quantization/kv_quant.md) for detailed guide
 - \[2024/04\] TurboMind latest upgrade boosts GQA, rocketing the [internlm2-20b](https://huggingface.co/internlm/internlm2-20b) model inference to 16+ RPS, about 1.8x faster than vLLM.
 - \[2024/04\] Support Qwen1.5-MOE and dbrx.

diff --git a/README_zh-CN.md b/README_zh-CN.md
@@ -26,6 +26,7 @@ ______________________________________________________________________
 <details open>
 <summary><b>2024</b></summary>
 
+- \[2024/04\] 支持 Llama3 和 InternVL v1.1, v1.2，MiniGemini，InternLM-XComposer2 等 VLM 模型
 - \[2024/04\] TurboMind 支持 kv cache int4/int8 在线量化和推理，适用已支持的所有型号显卡。详情请参考[这里](docs/zh_cn/quantization/kv_quant.md)
 - \[2024/04\] TurboMind 引擎升级，优化 GQA 推理。[internlm2-20b](https://huggingface.co/internlm/internlm2-20b) 推理速度达 16+ RPS，约是 vLLM 的 1.8 倍
 - \[2024/04\] 支持 Qwen1.5-MOE 和 dbrx.
@@ -93,14 +94,15 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力，在各种规模的模型
 | :-----------------: | :---------: |
 |        Llama        |  7B - 65B   |
 |       Llama2        |  7B - 70B   |
+|       Llama3        |   8B, 70B   |
 |      InternLM       |  7B - 20B   |
 |      InternLM2      |  7B - 20B   |
 | InternLM-XComposer  |     7B      |
 | InternLM-XComposer2 | 7B, 4khd-7B |
 |        QWen         | 1.8B - 72B  |
-|       QWen-VL       |     7B      |
 |       QWen1.5       | 0.5B - 72B  |
 |     QWen1.5-MoE     |    A2.7B    |
+|       QWen-VL       |     7B      |
 |      Baichuan       |  7B - 13B   |
 |      Baichuan2      |  7B - 13B   |
 |     Code Llama      |  7B - 34B   |

diff --git a/docs/en/quantization/kv_quant.md b/docs/en/quantization/kv_quant.md
@@ -1,6 +1,6 @@
 # Key-Value(KV) Cache Quantization
 
-The latest main branch of LMDeploy supports **online** key-value (kv) cache quantization with int4 and int8 numerical precision, utilizing an asymmetric quantization method that is applied on a per-head, per-token basis. The original kv offline quantization method has been removed.
+Since v0.4.0, LMDeploy has supported **online** key-value (kv) cache quantization with int4 and int8 numerical precision, utilizing an asymmetric quantization method that is applied on a per-head, per-token basis. The original kv offline quantization method has been removed.
 
 Intuitively, quantizing the kv cache is beneficial for reducing memory usage. Compared to FP16, the memory for int4/int8 kv can be reduced to 1/4 and 1/2, respectively. This means that under the same memory conditions, the system can support a significantly increased number of concurrent operations after kv quantization, thereby ultimately enhancing throughput.
 
@@ -21,7 +21,11 @@ In summary, LMDeploy kv quantization has the following advantages:
 3. KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
 4. Efficient inference, with int8/int4 kv quantization applied to llama2-7b, RPS is improved by round 30% and 40% respectively compared to fp16
 
-In the next section, we will take `internlm2-chat-7b` model as an example, introducing the usage of kv quantization and inference of lmdeploy. But before that, please install lmdeploy from source according to the [build](../build.md) guide, because lmdeploy hasn't released this feature yet.
+In the next section, we will take `internlm2-chat-7b` model as an example, introducing the usage of kv quantization and inference of lmdeploy. But before that, please ensure that lmdeploy is installed.
+
+```shell
+pip install lmdeploy
+```
 
 ## Usage
 

diff --git a/docs/zh_cn/quantization/kv_quant.md b/docs/zh_cn/quantization/kv_quant.md
@@ -1,6 +1,6 @@
 # Key-Value(KV) Cache 量化
 
-LMDeploy 最新 main 分支支持**在线** kv cache int4/int8 量化，量化方式为 per-head per-token 的非对称量化。原来的 kv 离线量化方式移除。
+自 v0.4.0 起，LMDeploy 支持**在线** kv cache int4/int8 量化，量化方式为 per-head per-token 的非对称量化。原来的 kv 离线量化方式移除。
 
 直观上看，量化 kv 利于降低内存占用量。和 fp16 相比，int4/int8 kv 的内存可以分别减到 1/4 和 1/2。这意味着，在相同的内存条件下，kv 量化后，系统能支撑的并发数可以大幅提升，从而最终提高吞吐量。
 
@@ -21,7 +21,11 @@ LMDeploy kv 4/8 bit 量化和推理支持如下 NVIDIA 显卡型号：
 3. kv int8 量化精度几乎无损，kv int4 量化精度在可接受范围之内
 4. 推理高效，在 llama2-7b 上加入 int8/int4 kv 量化，RPS 相较于 fp16 分别提升近 30% 和 40%
 
-接下来，我们以 internlm2-chat-7b 模型为例，介绍 kv 量化和推理的若干应用。而在此之前，请首先参考[文档](https://lmdeploy.readthedocs.io/en/latest/build.html)，源码安装 lmdeploy，因为 kv cache 4bit/8bit 在线量化尚未发版。
+接下来，我们以 internlm2-chat-7b 模型为例，介绍 kv 量化和推理的若干应用。而在此之前，请安装 lmdeploy
+
+```shell
+pip install lmdeploy
+```
 
 ## 应用示例
 

diff --git a/lmdeploy/version.py b/lmdeploy/version.py
@@ -1,7 +1,7 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 from typing import Tuple
 
-__version__ = '0.3.0'
+__version__ = '0.4.0'
 short_version = __version__
 
 

diff --git a/requirements/runtime.txt b/requirements/runtime.txt
@@ -1,3 +1,4 @@
+einops
 fastapi
 fire
 mmengine-lite
-Original file line number
+Diff line change
@@ -1,3 +1,4 @@
+    einops
     fastapi
     fire
     mmengine-lite
@@ Expand Down @@