-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
- Support Qwen3 VL
- Integrate FlashMLA
- Integrate DeepGEMM
- Async schedule
- Use flashinfer sample kernel
- Release gLLM to PyPI
- Optimize Input data creation Optimize Input data creation #24 Optimize input_data creation #75 Optimize get_slot_mapping #76
- CUDA graph [1/N] Cuda graph: create input buffer for model runner #134 [2/N] Cuda graph: output buffer #135 [3/N] Cuda graph: delay memory manager/KV cache init #136 [4/N] Cuda graph: add profile run #137 Separate sample operation #139 Simplify pp data transmission #140 [5/N] Cuda graph: support cuda graph #141
- Profile run at begain [4/N] Cuda graph: add profile run #137
- Integrate Token Throttling into vLLM [RFC][PP]: Integrate Token Throttling into vLLM vllm-project/vllm#20298
- Support VLLM Support for qwen2_5_vl #108
- Support Deepseek V2/3 Support Deepseek V3 #130 Support mla for deepseek V2/3 #104 Add support for Deepseek V2/3 #103
- Quantization Support fp8 moe #128 Fix fp8 moe #129 can support fp8 model? #91 Add support for quantization method fp8 (qwen3) #94
- EP Support Expert Parallelism #79
- Support MoE model Support MoE models #52
- Preempt Seqs when KV cache is used up
- Chunked prefill Implement Chunked prefill 🙌 #23
- TP Support TP #72
- PP Introduce PP to gLLM #15 Change PP communication #16 Optimize PP #17
- Add Multi node support Add Multi node support #40
- Tuning MoE kernel configuration Support TP #72
- Abort requests Support aborting requests #68
- Upgrade pytorch version Upgrade torch to 2.7.0 and flashattention #71
Metadata
Metadata
Assignees
Labels
No labels