LMDeploy Release v0.6.0
Highlight
- Optimize W4A16 quantized model inference by implementing GEMM in TurboMind Engine
- Add GPTQ-INT4 inference
- Support CUDA architecture from SM70 and above, equivalent to the V100 and above.
- Refactor PytorchEngine
- Employ CUDA graph to boost the inference performance (30%)
- Support more models in Huawei Ascend platform
- Upgrade
GenerationConfig
- Support
min_p
sampling - Add
do_sample=False
as the default option - Remove
EngineGenerationConfig
and merge it toGenertionConfig
- Support
- Support guided decoding
- Distinguish between the concepts of the name of the deployed model and the name of the model's chat tempate
Before:
lmdeploy serve api_server /the/path/of/your/awesome/model \
--model-name customized_chat_template.json
After
lmdeploy serve api_server /the/path/of/your/awesome/model \
--model-name "the served model name"
--chat-template customized_chat_template.json
Break Changes
- TurboMind model converter. Please re-convert the models if you uses this feature
EngineGenerationConfig
is removed. Please useGenerationConfig
instead- Chat template. Please use
--chat-template
to specify it
What's Changed
🚀 Features
- support vlm custom image process parameters in openai input format by @irexyc in #2245
- New GEMM kernels for weight-only quantization by @lzhangzz in #2090
- Fix hidden size and support mistral nemo by @AllentDan in #2215
- Support custom logits processors by @AllentDan in #2329
- support openbmb/MiniCPM-V-2_6 by @irexyc in #2351
- Support phi3.5 for pytorch engine by @RunningLeon in #2361
- Add auto_gptq to lmdeploy lite by @AllentDan in #2372
- build(ascend): add Dockerfile for ascend aarch64 910B by @CyCle1024 in #2278
- Support guided decoding for pytorch backend by @AllentDan in #1856
- support min_p sampling parameter by @irexyc in #2420
- Refactor pytorch engine by @grimoire in #2104
- refactor pytorch engine(ascend) by @yao-fengchen in #2440
💥 Improvements
- Remove deprecated arguments from API and clarify model_name and chat_template_name by @lvhan028 in #1931
- Fix duplicated session_id when pipeline is used by multithreads by @irexyc in #2134
- remove eviction param by @grimoire in #2285
- Remove QoS serving by @AllentDan in #2294
- Support send tool_calls back to internlm2 by @AllentDan in #2147
- Add stream options to control usage by @AllentDan in #2313
- add device type for pytorch engine in cli by @RunningLeon in #2321
- Update error status_code to raise error in openai client by @AllentDan in #2333
- Change to use device instead of device-type in cli by @RunningLeon in #2337
- Add GEMM test utils by @lzhangzz in #2342
- Add environment variable to control SILU fusion by @lzhangzz in #2343
- Use single thread per model instance by @lzhangzz in #2339
- add cache to speed up docker building by @RunningLeon in #2344
- add max_prefill_token_num argument in CLI by @lvhan028 in #2345
- torch engine optimize prefill for long context by @grimoire in #1962
- Refactor turbomind (1/N) by @lzhangzz in #2352
- feat(server): enable
seed
parameter for openai compatible server. by @DearPlanet in #2353 - support do_sample parameter by @irexyc in #2375
- refactor TurbomindModelConfig by @lvhan028 in #2364
- import dlinfer before imageencoding by @jinminxi104 in #2413
- ignore *.pth when download model from model hub by @lvhan028 in #2426
- inplace logits process as default by @grimoire in #2427
- handle invalid images by @irexyc in #2312
- Split token_embs and lm_head weights by @irexyc in #2252
- build: update ascend dockerfile by @CyCle1024 in #2421
- build nccl in dockerfile for cuda11.8 by @RunningLeon in #2433
- automatically set max_batch_size according to the device when it is not specified by @lvhan028 in #2434
- rename the ascend dockerfile by @lvhan028 in #2403
- refactor ascend kernels by @yao-fengchen in #2355
🐞 Bug fixes
- enable run vlm with pytorch engine in gradio by @RunningLeon in #2256
- fix side-effect: failed to update tm model config with tm engine config by @lvhan028 in #2275
- Fix internvl2 template and update docs by @irexyc in #2292
- fix the issue missing dependencies in the Dockerfile and pip by @ColorfulDick in #2240
- Fix the way to get "quantization_config" from model's coniguration by @lvhan028 in #2325
- fix(ascend): fix import error of pt engine in cli by @CyCle1024 in #2328
- Default rope_scaling_factor of TurbomindEngineConfig to None by @lvhan028 in #2358
- Fix the logic of update engine_config to TurbomindModelConfig for both tm model and hf model by @lvhan028 in #2362
- fix cache position for pytorch engine by @RunningLeon in #2388
- Fix /v1/completions batch order wrong by @AllentDan in #2395
- Fix some issues encountered by modelscope and community by @irexyc in #2428
- fix llama3 rotary in pytorch engine by @grimoire in #2444
- fix tensors on different devices when deploying MiniCPM-V-2_6 with tensor parallelism by @irexyc in #2454
- fix MultinomialSampling operator builder by @grimoire in #2460
- Fix initialization of runtime_min_p by @irexyc in #2461
- fix Windows compile error by @zhyncs in #2303
- fix: follow up #2303 by @zhyncs in #2307
📚 Documentations
- Reorganize the user guide and update the get_started section by @lvhan028 in #2038
- cancel support baichuan2 7b awq in pytorch engine by @grimoire in #2246
- Add user guide about slora serving by @AllentDan in #2084
- Reorganize the table of content of get_started by @lvhan028 in #2378
- fix get_started user guide unaccessible by @lvhan028 in #2410
- add Ascend get_started by @jinminxi104 in #2417
🌐 Other
- test prtest image update by @zhulinJulia24 in #2192
- Update python support version by @wuhongsheng in #2290
- [ci] benchmark react by @zhulinJulia24 in #2183
- bump version to v0.6.0a0 by @lvhan028 in #2371
- [ci] add daily test's coverage report by @zhulinJulia24 in #2401
- update actions/download-artifact to v4 to fix security issue by @lvhan028 in #2419
- bump version to v0.6.0 by @lvhan028 in #2445
New Contributors
- @wuhongsheng made their first contribution in #2290
- @ColorfulDick made their first contribution in #2240
- @DearPlanet made their first contribution in #2353
- @jinminxi104 made their first contribution in #2413
Full Changelog: v0.5.3...v0.6.0