Releases: InternLM/lmdeploy
Releases · InternLM/lmdeploy
LMDeploy Release V0.1.0a0
What's Changed
🚀 Features
- Add extra_requires to reduce dependencies by @RunningLeon in #580
- TurboMind 2 by @lzhangzz in #590
- Support loading hf model directly by @irexyc in #685
💥 Improvements
- Fix Tokenizer encode by @AllentDan in #645
- Optimize for throughput by @lzhangzz in #701
- Replace mmengine with mmengine-lite by @zhouzaida in #715
🐞 Bug fixes
- Fix init of batch state by @lzhangzz in #682
- fix turbomind stream canceling by @grimoire in #686
- [Fix] Fix load_checkpoint_in_model bug by @HIT-cwh in #690
- Fix wrong eos_id and bos_id obtained through grpc api by @lvhan028 in #644
- Fix cache/output length calculation by @lzhangzz in #738
- [Fix] Skip empty batch by @lzhangzz in #747
📚 Documentations
- [Docs] Update Supported Matrix by @pppppM in #679
- [Docs] Update KV8 Docs by @pppppM in #681
- [Doc] Update restful api doc by @AllentDan in #662
- Check-in user guide about turbomind config by @lvhan028 in #680
🌐 Other
New Contributors
- @zhouzaida made their first contribution in #715
Full Changelog: v0.0.14...v0.1.0a0
LMDeploy Release V0.0.14
What's Changed
💥 Improvements
- Improve api_server and webui usage by @AllentDan in #544
- fix: gradio gr.Button.update deprecated after 4.0.0 by @hscspring in #637
- add cli to list the supported model names by @RunningLeon in #639
- Refactor model conversion by @irexyc in #296
- [Enchance] internlm message to prompt by @Harold-lkk in #499
- update turbomind session_len with model.session_len by @AllentDan in #634
- Manage session id using random int for gradio local mode by @aisensiy in #553
- Add UltraCM and WizardLM chat templates by @AllentDan in #599
- Add check env sub command by @RunningLeon in #654
🐞 Bug fixes
- [Fix] Qwen's quantization results are abnormal & Baichuan cannot be quantized by @pppppM in #605
- FIX: fix stop_session func bug by @yunzhongyan0 in #578
- fix benchmark serving computation mistake by @AllentDan in #630
- fix Tokenizer load error when the path of the being-converted model is not writable by @irexyc in #669
- fix tokenizer_info when convert the model by @irexyc in #661
🌐 Other
New Contributors
- @hscspring made their first contribution in #637
- @yunzhongyan0 made their first contribution in #578
Full Changelog: v0.0.13...v0.0.14
LMDeploy Release V0.0.13
What's Changed
🚀 Features
- Add more user-friendly CLI by @RunningLeon in #541
💥 Improvements
- support inference a batch of prompts by @AllentDan in #467
📚 Documentations
🌐 Other
Full Changelog: v0.0.12...v0.0.13
LMDeploy Release V0.0.12
What's Changed
🚀 Features
- add solar chat template by @AllentDan in #576 and #587
💥 Improvements
- change
model_format
toqwen
whenmodel_name
starts withqwen
by @lvhan028 in #575 - robust incremental decode for leading space by @AllentDan in #581
🐞 Bug fixes
- avoid splitting chinese characters during decoding by @AllentDan in #566
- Revert "[Docs] Simplify
build.md
" by @pppppM in #586 - Fix crash and remove
sys_instruct
fromchat.py
andclient.py
by @irexyc in #591
🌐 Other
Full Changelog: v0.0.11...v0.0.12
LMDeploy Release V0.0.11
What's Changed
🚀 Features
💥 Improvements
- make IPv6 compatible, safe run for coroutine interrupting by @AllentDan in #487
- support deploy qwen-14b-chat by @irexyc in #482
- add tp hint for deployment by @irexyc in #555
- Move
tokenizer.py
to the folder of lmdeploy by @grimoire in #543
🐞 Bug fixes
- Change
shared_instance
type fromweakptr
toshared_ptr
by @lvhan028 in #507 - [Fix] Set the default value of
step
being 0 by @lvhan028 in #532 - [bug] fix mismatched shape for decoder output tensor by @akhoroshev in #517
- Fix typing of openai protocol. by @mokeyish in #554
📚 Documentations
- Fix typo in
docs/en/pytorch.md
by @shahrukhx01 in #539 - [Doc] update huggingface internlm-chat-7b model url by @AllentDan in #546
- [doc] Update benchmark command in w4a16.md by @del-zhenwu in #500
🌐 Other
New Contributors
- @shahrukhx01 made their first contribution in #539
- @mokeyish made their first contribution in #554
Full Changelog: v0.0.10...v0.0.11
LMDeploy Release V0.0.10
What's Changed
💥 Improvements
- [feature] Graceful termination of background threads in LlamaV2 by @akhoroshev in #458
- expose stop words and filter eoa by @AllentDan in #352
🐞 Bug fixes
- Fix side effect brought by supporting codellama:
sequence_start
is always true when callingmodel.get_prompt
by @lvhan028 in #466 - Miss meta instruction of internlm-chat model by @lvhan028 in #470
- [bug] Fix race condition by @akhoroshev in #460
- Fix compatibility issues with Pydantic 2 by @aisensiy in #465
- fix benchmark serving cannot use Qwen tokenizer by @AllentDan in #443
- Fix memory leak by @lvhan028 in #488
📚 Documentations
- Fix typo in README.md by @eltociear in #462
🌐 Other
New Contributors
- @eltociear made their first contribution in #462
- @akhoroshev made their first contribution in #458
- @aisensiy made their first contribution in #465
Full Changelog: v0.0.9...v0.0.10
LMDeploy Release V0.0.9
Highlight
- Support InternLM 20B, including FP16, W4A16, and W4KV8
What's Changed
🚀 Features
💥 Improvements
- Reduce gil switching by @irexyc in #407
- Profile token generation with more settings by @AllentDan in #364
🐞 Bug fixes
- Fix disk space limit for building docker image by @RunningLeon in #404
- more general pypi ci by @irexyc in #412
- Fix build.md by @pangsg in #411
- Fix memory leak by @irexyc in #415
- Fix token count bug by @AllentDan in #416
- [Fix] Support actual seqlen in flash-attention2 by @grimoire in #418
- [Fix] output[-1] when output is empty by @wangruohui in #405
🌐 Other
- rename readthedocs config file by @RunningLeon in #429
- bump version to v0.0.9 by @lvhan028 in #428
New Contributors
Full Changelog: v0.0.8...v0.0.9
LMDeploy Release V0.0.8
Highlight
- Support Baichuan2-7B-Base and Baichuan2-7B-Chat
- Support all features of Code Llama: code completion, infilling, chat / instruct, and python specialist
What's Changed
🚀 Features
- Support baichuan2-chat chat template by @wangruohui in #378
- Support codellama by @lvhan028 in #359
🐞 Bug fixes
- [Fix] when using stream is False, continuous batching doesn't work by @sleepwalker2017 in #346
- [Fix] Set max dynamic smem size for decoder MHA to support context length > 8k by @lvhan028 in #377
- Fix exceed session len core dump for chat and generate by @AllentDan in #366
- [Fix] update puyu model by @Harold-lkk in #399
📚 Documentations
- [Docs] Fix quantization docs link by @LZHgrla in #367
- [Docs] Simplify
build.md
by @pppppM in #370 - [Docs] Update lmdeploy logo by @lvhan028 in #372
New Contributors
- @sleepwalker2017 made their first contribution in #346
Full Changelog: v0.0.7...v0.0.8
LMDeploy Release V0.0.7
Highlights
- Flash attention 2 is supported, boosting context decoding speed by approximately 45%
- Token_id decoding has been optimized for better efficiency
- The gemm-tunned script has been packed in the PyPI package
What's Changed
🚀 Features
💥 Improvements
- add llama_gemm to wheel by @irexyc in #320
- Decode generated token_ids incrementally by @AllentDan in #309
🐞 Bug fixes
- Fix turbomind import error on windows by @irexyc in #316
- Fix profile_serving hung issue by @lvhan028 in #344
📚 Documentations
- Fix readthedocs building by @RunningLeon in #321
- fix(kvint8): update doc by @tpoisonooo in #315
- Update FAQ for restful api by @AllentDan in #319
Full Changelog: v0.0.6...v0.0.7
LMDeploy Release V0.0.6
Highlights
- Support Qwen-7B with dynamic NTK scaling and logN scaling in turbomind
- Support tensor parallelism for W4A16
- Add OpenAI-like RESTful API
- Support Llama-2 70B 4-bit quantization
What's Changed
🚀 Features
- Profiling tool for huggingface and deepspeed models by @wangruohui in #161
- Support windows platform by @irexyc in #209
- Qwen-7B, dynamic NTK scaling and logN scaling support in turbomind by @lzhangzz in #230
- Add Restful API by @AllentDan in #223
- Support context decoding with DP in pytorch by @wangruohui in #193
💥 Improvements
- Support TP for W4A16 by @lzhangzz in #262
- Pass chat template args including meta_prompt to model(7785142) by @AllentDan in #225
- Enable the Gradio server to call inference services through the RESTful API by @AllentDan in #287
🐞 Bug fixes
- Adjust dependency of gradio server by @AllentDan in #236
- Implement
movmatrix
using warp shuffling for CUDA < 11.8 by @lzhangzz in #267 - Add 'accelerate' to requirement list by @lvhan028 in #261
- Fix building with CUDA 11.3 by @lzhangzz in #280
- Pad tok_embedding and output weights to make their shape divisible by TP by @lvhan028 in #285
- Fix llama2 70b & qwen quantization error by @pppppM in #273
- Import turbomind in gradio server only when it is needed by @AllentDan in #303
📚 Documentations
- Remove specified version in user guide by @lvhan028 in #241
- docs(quantzation): update description by @tpoisonooo in #253 and #272
- Check-in FAQ by @lvhan028 in #256
- add readthedocs by @RunningLeon in #208
🌐 Other
- Update workflow for building docker image by @RunningLeon in #282
- Change to github-hosted runner for building docker image by @RunningLeon in #291
Known issues
- 4-bit Qwen-7b model inference failed. #307 is addressing this issue.
Full Changelog: v0.0.5...v0.0.6