LMDeploy Release v0.6.2
Highlights
- PyTorch engine supports graph mode on ascend platform, doubling the inference speed
- Support llama3.2-vision models in PyTorch engine
- Support Mixtral in TurboMind engine, achieving 20+ RPS using SharedGPT dataset with 2 A100-80G GPUs
What's Changed
🚀 Features
- support downloading models from openmind_hub by @cookieyyds in #2563
- Support pytorch engine kv int4/int8 quantization by @AllentDan in #2438
- feat(ascend): support w4a16 by @yao-fengchen in #2587
- [maca] add maca backend support. by @Reinerzhou in #2636
- Support mllama for pytorch engine by @AllentDan in #2605
- add --eager-mode to cli by @RunningLeon in #2645
- [ascend] add ascend graph mode by @CyCle1024 in #2647
- MoE support for turbomind by @lzhangzz in #2621
💥 Improvements
- [Feature] Add argument to disable FastAPI docs by @mouweng in #2540
- add check for device with cap 7.x by @grimoire in #2535
- Add tool role for langchain usage by @AllentDan in #2558
- Fix llama3.2-1b inference error by handling tie_word_embedding by @grimoire in #2568
- Add a workaround for saving internvl2 with latest transformers by @AllentDan in #2583
- optimize paged attention on triton3 by @grimoire in #2553
- refactor for multi backends in dlinfer by @CyCle1024 in #2619
- Copy sglang/bench_serving.py to lmdeploy as serving benchmark script by @lvhan028 in #2620
- Add barrier to prevent TP nccl kernel waiting. by @grimoire in #2607
- [ascend] refactor fused_moe on ascend platform by @yao-fengchen in #2613
- [ascend] support paged_prefill_attn when batch > 1 by @yao-fengchen in #2612
- Raise an error for the wrong chat template by @AllentDan in #2618
- refine pre-post-process by @jinminxi104 in #2632
- small block_m for sm7.x by @grimoire in #2626
- update check for triton by @grimoire in #2641
- Support llama3.2 LLM models in turbomind engine by @lvhan028 in #2596
- Check whether device support bfloat16 by @lvhan028 in #2653
- Add warning message about
do_sample
to alert BC by @lvhan028 in #2654 - update ascend dockerfile by @CyCle1024 in #2661
- fix supported model list in ascend graph mode by @jinminxi104 in #2669
- remove dlinfer version by @CyCle1024 in #2672
🐞 Bug fixes
- set outlines<0.1.0 by @AllentDan in #2559
- fix: make exit_flag verification for ascend more general by @CyCle1024 in #2588
- set capture mode thread_local by @grimoire in #2560
- Add distributed context in pytorch engine to support torchrun by @grimoire in #2615
- Fix error in python3.8. by @Reinerzhou in #2646
- Align UT with triton fill_kv_cache_quant kernel by @AllentDan in #2644
- miss device_type when checking is_bf16_supported on ascend platform by @lvhan028 in #2663
- fix syntax in Dockerfile_aarch64_ascend by @CyCle1024 in #2664
- Set history_cross_kv_seqlens to 0 by default by @AllentDan in #2666
- fix build error in ascend dockerfile by @CyCle1024 in #2667
- bugfix: llava-hf/llava-interleave-qwen-7b-hf (#2497) by @deepindeed2022 in #2657
- fix inference mode error for qwen2-vl by @irexyc in #2668
📚 Documentations
- Add instruction for downloading models from openmind hub by @cookieyyds in #2577
- Fix spacing in ascend user guide by @Superskyyy in #2601
- Update get_started tutorial about deploying on ascend platform by @jinminxi104 in #2655
- Update ascend get_started tutorial about installing nnal by @jinminxi104 in #2662
🌐 Other
- [ci] add oc infer test in stable test by @zhulinJulia24 in #2523
- update copyright by @lvhan028 in #2579
- [Doc]: Lock sphinx version by @RunningLeon in #2594
- [ci] use local requirements for test workflow by @zhulinJulia24 in #2569
- [ci] add pytorch kvint testcase into function regresstion by @zhulinJulia24 in #2584
- [ci] React dailytest workflow by @zhulinJulia24 in #2617
- [ci] fix restful script by @zhulinJulia24 in #2635
- [ci] add internlm2_5_7b_batch_1 into evaluation testcase by @zhulinJulia24 in #2631
- match torch and torch_vision version by @grimoire in #2649
- Bump version to v0.6.2 by @lvhan028 in #2659
New Contributors
- @mouweng made their first contribution in #2540
- @cookieyyds made their first contribution in #2563
- @Superskyyy made their first contribution in #2601
- @Reinerzhou made their first contribution in #2636
- @deepindeed2022 made their first contribution in #2657
Full Changelog: v0.6.1...v0.6.2