Skip to content

Commit

Permalink
bump version to v0.4.0 (#1469)
Browse files Browse the repository at this point in the history
* bump version to v0.4.0

* update

* update news

* update news

* update supported models
  • Loading branch information
lvhan028 authored Apr 23, 2024
1 parent 6b8718d commit 04ba0ff
Show file tree
Hide file tree
Showing 6 changed files with 18 additions and 6 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ ______________________________________________________________________
<details open>
<summary><b>2024</b></summary>

- \[2024/04\] Support Llama3 and more VLMs, such as InternVL v1.1, v1.2, MiniGemini, InternLMXComposer2.
- \[2024/04\] TurboMind adds online int8/int4 KV cache quantization and inference for all supported devices. Refer [here](docs/en/quantization/kv_quant.md) for detailed guide
- \[2024/04\] TurboMind latest upgrade boosts GQA, rocketing the [internlm2-20b](https://huggingface.co/internlm/internlm2-20b) model inference to 16+ RPS, about 1.8x faster than vLLM.
- \[2024/04\] Support Qwen1.5-MOE and dbrx.
Expand Down
4 changes: 3 additions & 1 deletion README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ ______________________________________________________________________
<details open>
<summary><b>2024</b></summary>

- \[2024/04\] 支持 Llama3 和 InternVL v1.1, v1.2,MiniGemini,InternLM-XComposer2 等 VLM 模型
- \[2024/04\] TurboMind 支持 kv cache int4/int8 在线量化和推理,适用已支持的所有型号显卡。详情请参考[这里](docs/zh_cn/quantization/kv_quant.md)
- \[2024/04\] TurboMind 引擎升级,优化 GQA 推理。[internlm2-20b](https://huggingface.co/internlm/internlm2-20b) 推理速度达 16+ RPS,约是 vLLM 的 1.8 倍
- \[2024/04\] 支持 Qwen1.5-MOE 和 dbrx.
Expand Down Expand Up @@ -93,14 +94,15 @@ LMDeploy TurboMind 引擎拥有卓越的推理能力,在各种规模的模型
| :-----------------: | :---------: |
| Llama | 7B - 65B |
| Llama2 | 7B - 70B |
| Llama3 | 8B, 70B |
| InternLM | 7B - 20B |
| InternLM2 | 7B - 20B |
| InternLM-XComposer | 7B |
| InternLM-XComposer2 | 7B, 4khd-7B |
| QWen | 1.8B - 72B |
| QWen-VL | 7B |
| QWen1.5 | 0.5B - 72B |
| QWen1.5-MoE | A2.7B |
| QWen-VL | 7B |
| Baichuan | 7B - 13B |
| Baichuan2 | 7B - 13B |
| Code Llama | 7B - 34B |
Expand Down
8 changes: 6 additions & 2 deletions docs/en/quantization/kv_quant.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Key-Value(KV) Cache Quantization

The latest main branch of LMDeploy supports **online** key-value (kv) cache quantization with int4 and int8 numerical precision, utilizing an asymmetric quantization method that is applied on a per-head, per-token basis. The original kv offline quantization method has been removed.
Since v0.4.0, LMDeploy has supported **online** key-value (kv) cache quantization with int4 and int8 numerical precision, utilizing an asymmetric quantization method that is applied on a per-head, per-token basis. The original kv offline quantization method has been removed.

Intuitively, quantizing the kv cache is beneficial for reducing memory usage. Compared to FP16, the memory for int4/int8 kv can be reduced to 1/4 and 1/2, respectively. This means that under the same memory conditions, the system can support a significantly increased number of concurrent operations after kv quantization, thereby ultimately enhancing throughput.

Expand All @@ -21,7 +21,11 @@ In summary, LMDeploy kv quantization has the following advantages:
3. KV int8 quantization has almost lossless accuracy, and KV int4 quantization accuracy is within an acceptable range
4. Efficient inference, with int8/int4 kv quantization applied to llama2-7b, RPS is improved by round 30% and 40% respectively compared to fp16

In the next section, we will take `internlm2-chat-7b` model as an example, introducing the usage of kv quantization and inference of lmdeploy. But before that, please install lmdeploy from source according to the [build](../build.md) guide, because lmdeploy hasn't released this feature yet.
In the next section, we will take `internlm2-chat-7b` model as an example, introducing the usage of kv quantization and inference of lmdeploy. But before that, please ensure that lmdeploy is installed.

```shell
pip install lmdeploy
```

## Usage

Expand Down
8 changes: 6 additions & 2 deletions docs/zh_cn/quantization/kv_quant.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Key-Value(KV) Cache 量化

LMDeploy 最新 main 分支支持**在线** kv cache int4/int8 量化,量化方式为 per-head per-token 的非对称量化。原来的 kv 离线量化方式移除。
自 v0.4.0 起,LMDeploy 支持**在线** kv cache int4/int8 量化,量化方式为 per-head per-token 的非对称量化。原来的 kv 离线量化方式移除。

直观上看,量化 kv 利于降低内存占用量。和 fp16 相比,int4/int8 kv 的内存可以分别减到 1/4 和 1/2。这意味着,在相同的内存条件下,kv 量化后,系统能支撑的并发数可以大幅提升,从而最终提高吞吐量。

Expand All @@ -21,7 +21,11 @@ LMDeploy kv 4/8 bit 量化和推理支持如下 NVIDIA 显卡型号:
3. kv int8 量化精度几乎无损,kv int4 量化精度在可接受范围之内
4. 推理高效,在 llama2-7b 上加入 int8/int4 kv 量化,RPS 相较于 fp16 分别提升近 30% 和 40%

接下来,我们以 internlm2-chat-7b 模型为例,介绍 kv 量化和推理的若干应用。而在此之前,请首先参考[文档](https://lmdeploy.readthedocs.io/en/latest/build.html),源码安装 lmdeploy,因为 kv cache 4bit/8bit 在线量化尚未发版。
接下来,我们以 internlm2-chat-7b 模型为例,介绍 kv 量化和推理的若干应用。而在此之前,请安装 lmdeploy

```shell
pip install lmdeploy
```

## 应用示例

Expand Down
2 changes: 1 addition & 1 deletion lmdeploy/version.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Tuple

__version__ = '0.3.0'
__version__ = '0.4.0'
short_version = __version__


Expand Down
1 change: 1 addition & 0 deletions requirements/runtime.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
einops
fastapi
fire
mmengine-lite
Expand Down

0 comments on commit 04ba0ff

Please sign in to comment.