Skip to content

LMDeploy Release V0.0.6

Compare
Choose a tag to compare
@lvhan028 lvhan028 released this 25 Aug 13:30
· 954 commits to main since this release
cfabbbd

Highlights

  • Support Qwen-7B with dynamic NTK scaling and logN scaling in turbomind
  • Support tensor parallelism for W4A16
  • Add OpenAI-like RESTful API
  • Support Llama-2 70B 4-bit quantization

What's Changed

🚀 Features

💥 Improvements

🐞 Bug fixes

  • Adjust dependency of gradio server by @AllentDan in #236
  • Implement movmatrix using warp shuffling for CUDA < 11.8 by @lzhangzz in #267
  • Add 'accelerate' to requirement list by @lvhan028 in #261
  • Fix building with CUDA 11.3 by @lzhangzz in #280
  • Pad tok_embedding and output weights to make their shape divisible by TP by @lvhan028 in #285
  • Fix llama2 70b & qwen quantization error by @pppppM in #273
  • Import turbomind in gradio server only when it is needed by @AllentDan in #303

📚 Documentations

🌐 Other

Known issues

  • 4-bit Qwen-7b model inference failed. #307 is addressing this issue.

Full Changelog: v0.0.5...v0.0.6