LMDeploy Release V0.0.6

lvhan028 released this 25 Aug 13:30

· 954 commits to main since this release

cfabbbd

Highlights

Support Qwen-7B with dynamic NTK scaling and logN scaling in turbomind
Support tensor parallelism for W4A16
Add OpenAI-like RESTful API
Support Llama-2 70B 4-bit quantization

What's Changed

🚀 Features

Profiling tool for huggingface and deepspeed models by @wangruohui in #161
Support windows platform by @irexyc in #209
Qwen-7B, dynamic NTK scaling and logN scaling support in turbomind by @lzhangzz in #230
Add Restful API by @AllentDan in #223
Support context decoding with DP in pytorch by @wangruohui in #193

💥 Improvements

Support TP for W4A16 by @lzhangzz in #262
Pass chat template args including meta_prompt to model(7785142) by @AllentDan in #225
Enable the Gradio server to call inference services through the RESTful API by @AllentDan in #287

🐞 Bug fixes

Adjust dependency of gradio server by @AllentDan in #236
Implement movmatrix using warp shuffling for CUDA < 11.8 by @lzhangzz in #267
Add 'accelerate' to requirement list by @lvhan028 in #261
Fix building with CUDA 11.3 by @lzhangzz in #280
Pad tok_embedding and output weights to make their shape divisible by TP by @lvhan028 in #285
Fix llama2 70b & qwen quantization error by @pppppM in #273
Import turbomind in gradio server only when it is needed by @AllentDan in #303

📚 Documentations

Remove specified version in user guide by @lvhan028 in #241
docs(quantzation): update description by @tpoisonooo in #253 and #272
Check-in FAQ by @lvhan028 in #256
add readthedocs by @RunningLeon in #208

🌐 Other

Update workflow for building docker image by @RunningLeon in #282
Change to github-hosted runner for building docker image by @RunningLeon in #291

Known issues

4-bit Qwen-7b model inference failed. #307 is addressing this issue.

Full Changelog: v0.0.5...v0.0.6

Contributors

lvhan028, tpoisonooo, and 6 other contributors

Assets 2