Release LMDeploy Release V0.4.2 · InternLM/lmdeploy

Highlight

Support 4-bit weight-only quantization and inference on VMLs, such as InternVL v1.5, LLaVa, InternLMXComposer2

Quantization

lmdeploy lite auto_awq OpenGVLab/InternVL-Chat-V1-5 --work-dir ./InternVL-Chat-V1-5-AWQ

Inference with quantized model

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('./InternVL-Chat-V1-5-AWQ', backend_config=TurbomindEngineConfig(tp=1, model_format='awq'))

img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)

Balance vision model when deploying VLMs with multiple GPUs

from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image

pipe = pipeline('OpenGVLab/InternVL-Chat-V1-5', backend_config=TurbomindEngineConfig(tp=2))

img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)

What's Changed

🚀 Features

PyTorch Engine hash table based prefix caching by @grimoire in #1429
support phi3 by @grimoire in #1497
Turbomind prefix caching by @ispobock in #1450
Enable search scale for awq by @AllentDan in #1545
[Feature] Support vl models quantization by @AllentDan in #1553

💥 Improvements

make Qwen compatible with Slora when TP > 1 by @jjjjohnson in #1518
Optimize slora by @grimoire in #1447
Use a faster format for images in VLMs by @isidentical in #1575
add chat-template args to chat cli by @RunningLeon in #1566
Get the max session len from config.json by @AllentDan in #1550
Optimize w8a8 kernel by @grimoire in #1353
support python 3.12 by @irexyc in #1605
Optimize moe by @grimoire in #1520
Balance vision model weights on multi gpus by @irexyc in #1591
Support user-specified IMAGE_TOKEN position for deepseek-vl model by @irexyc in #1627
Optimize GQA/MQA by @grimoire in #1649

🐞 Bug fixes

fix logger init by @AllentDan in #1598
Bugfix: wrongly assign gen_config with True by @thelongestusernameofall in #1594
Enable split-kv for attention by @lzhangzz in #1606
Fix xcomposer2 vision model process by @irexyc in #1640
Fix NTK scaling by @lzhangzz in #1636
Fix illegal memory access when seq_len < 64 by @lzhangzz in #1616
Fix llava vl template by @irexyc in #1620
[side-effect] fix deepseek-vl when tp is 1 by @irexyc in #1648
fix logprobs output by @irexyc in #1561
fix fused-moe in triton2.2.0 by @grimoire in #1654
Align tokenizers in pipeline and api_server benchmark scripts by @AllentDan in #1650
[side-effect] fix UnboundLocalError for internlm-xcomposer2-4khd-7b by @irexyc in #1661
remove paged attention prefill autotune by @grimoire in #1658
Fix transformers 4.41.0 prompt may differ after encode decode by @AllentDan in #1617

📚 Documentations

Fix typo in w8a8.md by @chg0901 in #1568
Update doc for prefix caching by @ispobock in #1597
Update VL document by @AllentDan in #1657

🌐 Other

remove first empty token check and add input validation testcase by @zhulinJulia24 in #1549
add more model into benchmark and evaluate workflow by @zhulinJulia24 in #1565
add vl awq testcase and refactor pipeline testcase by @zhulinJulia24 in #1630
bump version to v0.4.2 by @lvhan028 in #1644

New Contributors

@isidentical made their first contribution in #1575
@chg0901 made their first contribution in #1568
@thelongestusernameofall made their first contribution in #1594

Full Changelog: v0.4.1...v0.4.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LMDeploy Release V0.4.2

Highlight

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

📚 Documentations

🌐 Other

New Contributors

Contributors