Efficient-R1-VLLM

🚀 Efficient-R1-VLLM: Efficient RL-Tuned MoE Vision-Language Model For Reasoning

is the first project to apply reward-based reinforcement learning (GRPO) to finetune Mixture-of-Experts (MoE) vision-language model,(DeepSeek2-VL,) for multimodal reasoning tasks. We focus on optimizing training efficiency via SGLang-accelerated rollouts while maintaining the model’s reasoning capabilities.

Pipeline: Vision-Language Input → DeepSeek2-VL MoE → GRPO Reward Optimization → Reasoning Output

🔥 Key Findings

Enforcing Image Captions Significantly Improves Vision-Language Model Performance
- Motivation: Compared to LLMs, VLLMs incorporate visual components, enabling them to leverage image information during reasoning.
- Implementation: Modify prompts and reinforcement learning reward feedback to require the model to output an image caption before the reasoning output. For example:
```
<caption>A right-angled triangle is shown where one leg forms an obtuse angle at vertex A while another side runs parallel to its hypotenuse. Within this diagram, three points labeled P, E, and Q have been marked such that they correspondingly intersect lines AD and EC along segments DB and BE.</caption>
<think>xxxxxxxxxxxxxxxxxxxxxxx...</think>
</answer> xxxxxxxxxxxxxxxxxxxxxxxx...<answer>
```
- Results: : In the Qwen-7B-Instruct model with the GRPO algorithm, enforcing image captions led to significant performance improvements.
- Visualization: : The blue curve in blow figure is the accuracy for only the <think> and <answer> . The purple is the accuracy for enforcing <caption> .

🔥 Key Innovations

First RL-Tuned MoE Vision-Language
- Pioneer reinforcement learning adaptation of DeepSeek2-VL-MoE (8x experts) on complex vision-language datasets (e.g., ScienceQA, VCR).
SGLang-Optimized Rollouts
- Achieve 1.7 xfaster trajectory sampling by integrating SGLang with DeepSeek2-VL’s official codebase.
Embedded evaluation loop in trl framework

🚀 Quick Start

code will come soon .

Installation

cd MoERL-Vision  
pip install -r requirements.txt  # Requires CUDA 12.x and NVIDIA GPUs  

# Install SGLang for accelerated rollouts  
pip install "sglang[all]"

Our codebase builds upon R1-Multimodal-Journey and integrates open-source contributions from vLLM, Open-R1, and trl. We also extend our gratitude to DeepSeek-R1 and Qwen2.5-VL for their open-source techniques and base models, which have enabled us to further our exploration.

Core Contributors (equal contribution with Alphabetical sorting)

Bai Bizhe

Thanks for the help from Professor Wenqi Shao and Qiaosheng Zhang for their help.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src/open_r1		src/open_r1
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient-R1-VLLM

🚀 Efficient-R1-VLLM: Efficient RL-Tuned MoE Vision-Language Model For Reasoning

🔥 Key Findings

🔥 Key Innovations

🚀 Quick Start

Installation

About

Releases

Packages

Languages

baibizhe/Efficient-R1-VLLM

Folders and files

Latest commit

History

Repository files navigation

Efficient-R1-VLLM

🚀 Efficient-R1-VLLM: Efficient RL-Tuned MoE Vision-Language Model For Reasoning

🔥 Key Findings

🔥 Key Innovations

🚀 Quick Start

Installation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages