LMDeploy Release V0.4.2
Highlight
- Support 4-bit weight-only quantization and inference on VMLs, such as InternVL v1.5, LLaVa, InternLMXComposer2
Quantization
lmdeploy lite auto_awq OpenGVLab/InternVL-Chat-V1-5 --work-dir ./InternVL-Chat-V1-5-AWQ
Inference with quantized model
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
pipe = pipeline('./InternVL-Chat-V1-5-AWQ', backend_config=TurbomindEngineConfig(tp=1, model_format='awq'))
img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)
- Balance vision model when deploying VLMs with multiple GPUs
from lmdeploy import pipeline, TurbomindEngineConfig
from lmdeploy.vl import load_image
pipe = pipeline('OpenGVLab/InternVL-Chat-V1-5', backend_config=TurbomindEngineConfig(tp=2))
img = load_image('https://raw.githubusercontent.com/open-mmlab/mmdeploy/main/tests/data/tiger.jpeg')
out = pipe(('describe this image', img))
print(out)
What's Changed
🚀 Features
- PyTorch Engine hash table based prefix caching by @grimoire in #1429
- support phi3 by @grimoire in #1497
- Turbomind prefix caching by @ispobock in #1450
- Enable search scale for awq by @AllentDan in #1545
- [Feature] Support vl models quantization by @AllentDan in #1553
💥 Improvements
- make Qwen compatible with Slora when TP > 1 by @jjjjohnson in #1518
- Optimize slora by @grimoire in #1447
- Use a faster format for images in VLMs by @isidentical in #1575
- add chat-template args to chat cli by @RunningLeon in #1566
- Get the max session len from config.json by @AllentDan in #1550
- Optimize w8a8 kernel by @grimoire in #1353
- support python 3.12 by @irexyc in #1605
- Optimize moe by @grimoire in #1520
- Balance vision model weights on multi gpus by @irexyc in #1591
- Support user-specified IMAGE_TOKEN position for deepseek-vl model by @irexyc in #1627
- Optimize GQA/MQA by @grimoire in #1649
🐞 Bug fixes
- fix logger init by @AllentDan in #1598
- Bugfix: wrongly assign gen_config with True by @thelongestusernameofall in #1594
- Enable split-kv for attention by @lzhangzz in #1606
- Fix xcomposer2 vision model process by @irexyc in #1640
- Fix NTK scaling by @lzhangzz in #1636
- Fix illegal memory access when seq_len < 64 by @lzhangzz in #1616
- Fix llava vl template by @irexyc in #1620
- [side-effect] fix deepseek-vl when tp is 1 by @irexyc in #1648
- fix logprobs output by @irexyc in #1561
- fix fused-moe in triton2.2.0 by @grimoire in #1654
- Align tokenizers in pipeline and api_server benchmark scripts by @AllentDan in #1650
- [side-effect] fix UnboundLocalError for internlm-xcomposer2-4khd-7b by @irexyc in #1661
- remove paged attention prefill autotune by @grimoire in #1658
- Fix transformers 4.41.0 prompt may differ after encode decode by @AllentDan in #1617
📚 Documentations
- Fix typo in w8a8.md by @chg0901 in #1568
- Update doc for prefix caching by @ispobock in #1597
- Update VL document by @AllentDan in #1657
🌐 Other
- remove first empty token check and add input validation testcase by @zhulinJulia24 in #1549
- add more model into benchmark and evaluate workflow by @zhulinJulia24 in #1565
- add vl awq testcase and refactor pipeline testcase by @zhulinJulia24 in #1630
- bump version to v0.4.2 by @lvhan028 in #1644
New Contributors
- @isidentical made their first contribution in #1575
- @chg0901 made their first contribution in #1568
- @thelongestusernameofall made their first contribution in #1594
Full Changelog: v0.4.1...v0.4.2