Skip to content

Commit cd7c8a8

Browse files
doc: update developer guide regarding mllms (#6138)
Signed-off-by: Xinyuan Tong <[email protected]> Co-authored-by: XinyuanTong <[email protected]> Co-authored-by: Xinyuan Tong <[email protected]>
1 parent 3e350a9 commit cd7c8a8

File tree

4 files changed

+84
-58
lines changed

4 files changed

+84
-58
lines changed

docs/index.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ The core features include:
3838
:caption: Supported Models
3939

4040
supported_models/generative_models.md
41-
supported_models/vision_language_models.md
41+
supported_models/multimodal_language_models.md
4242
supported_models/embedding_models.md
4343
supported_models/reward_models.md
4444
supported_models/support_new_models.md
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Multimodal Language Models
2+
3+
These models accept multi-modal inputs (e.g., images and text) and generate text output. They augment language models
4+
with multimodal encoders.
5+
6+
## Example launch Command
7+
8+
```shell
9+
python3 -m sglang.launch_server \
10+
--model-path meta-llama/Llama-3.2-11B-Vision-Instruct \ # example HF/local path
11+
--host 0.0.0.0 \
12+
--port 30000 \
13+
```
14+
15+
## Supporting Metrics
16+
17+
| Model Family (Variants) | Example HuggingFace Identifier | Chat Template | Description |
18+
|----------------------------|--------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
19+
| **Qwen-VL** (Qwen2 series) | `Qwen/Qwen2.5-VL-7B-Instruct` | `qwen2-vl` | Alibaba’s vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content. |
20+
| **DeepSeek-VL2** | `deepseek-ai/deepseek-vl2` | `deepseek-vl2` | Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs. |
21+
| **Janus-Pro** (1B, 7B) | `deepseek-ai/Janus-Pro-7B` | `janus-pro` | DeepSeek’s open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks. |
22+
| **MiniCPM-V / MiniCPM-o** | `openbmb/MiniCPM-V-2_6` | `minicpmv` | MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices. |
23+
| **Llama 3.2 Vision** (11B) | `meta-llama/Llama-3.2-11B-Vision-Instruct` | `llama_3_vision` | Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks. |
24+
| **LLaVA** (v1.5 & v1.6) | *e.g.* `liuhaotian/llava-v1.5-13b` | `vicuna_v1.1` | Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts. |
25+
| **LLaVA-NeXT** (8B, 72B) | `lmms-lab/llava-next-72b` | `chatml-llava` | Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks. |
26+
| **LLaVA-OneVision** | `lmms-lab/llava-onevision-qwen2-7b-ov` | `chatml-llava` | Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format. |
27+
| **Gemma 3 (Multimodal)** | `google/gemma-3-4b-it` | `gemma-it` | Gemma 3’s larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context. |
28+
| **Kimi-VL** (A3B) | `moonshotai/Kimi-VL-A3B-Instruct` | `kimi-vl` | Kimi-VL is a multimodal model that can understand and generate text from images. |
Lines changed: 55 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,59 @@
11
# How to Support New Models
22

3-
This document explains how to add support for new language models and vision‐language models (VLMs) in SGLang. It also covers how to test new models and register external implementations.
3+
This document explains how to add support for new language models and multimodal large language models (mllms) in
4+
SGLang. It also covers how to test new models and register external implementations.
45

56
## How to Support a new Language Model
67

7-
To support a new model in SGLang, you only need to add a single file under the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn from existing model implementations and create a new file for your model. For most models, you should be able to find a similar model to start with (e.g., starting from Llama). Also refer how to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang)
8+
To support a new model in SGLang, you only need to add a single file under
9+
the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn
10+
from existing model implementations and create a new file for your model. For most models, you should be able to find a
11+
similar model to start with (e.g., starting from Llama). Also refer how
12+
to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang)
813

9-
## How to Support a new Vision-Language model
14+
## How to Support a new Multimodal Large Language Model
1015

11-
To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard LLM support:
16+
To support a new multimodal large language model (MLLM) in SGLang, there are several key components in addition to the
17+
standard LLM support:
1218

1319
1. **Register your new model as multimodal**:
14-
Extend `is_multimodal_model` in [model_config.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to return `True` for your model.
20+
Extend `is_multimodal_model`
21+
in [model_config.py](https://github.com/sgl-project/sglang/blob/0ab3f437aba729b348a683ab32b35b214456efc7/python/sglang/srt/configs/model_config.py#L561)
22+
to return `True` for your model.
1523

16-
2. **Process Images**:
17-
Define a new `Processor` class that inherits from `BaseProcessor` and register this processor as your model’s dedicated processor. See [multimodal_processor.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py) for more details.
24+
2. **Register a new chat-template**
25+
See [conversation.py](https://github.com/sgl-project/sglang/blob/86a779dbe9e815c02f71ea82574608f6eae016b5/python/sglang/srt/conversation.py)
1826

19-
3. **Handle Image Tokens**:
20-
Implement a `pad_input_ids` function for your new model. In this function, image tokens in the prompt should be expanded and replaced with image-hashes so that SGLang can recognize different images when using `RadixAttention`.
27+
3. **Multimodal Data Processor**:
28+
Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your
29+
model’s dedicated processor.
30+
See [multimodal_processor.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py)
31+
for more details.
2132

22-
4. **Replace Vision Attention**:
23-
Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.
33+
4. **Handle Multimodal Tokens**:
34+
Implement a `pad_input_ids` function for your new model. In this function, multimodal tokens in the prompt should be
35+
expanded (if necessary) and padded with multimodal-data-hashes so that SGLang can recognize different multimodal data
36+
with `RadixAttention`.
2437

25-
You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other vLM implementations. These models demonstrate how to correctly handle both multimodal and textual inputs.
38+
5. **Adapt to Vision Attention**:
39+
Adapt the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.
2640

27-
You should test the new vLM locally against Hugging Face models. See the [`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) benchmark for an example.
41+
You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or
42+
other mllm implementations. These models demonstrate how to correctly handle both multimodal and textual inputs.
43+
44+
You should test the new MLLM locally against Hugging Face models. See the [
45+
`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) benchmark for an example.
2846

2947
## Test the Correctness
3048

3149
### Interactive Debugging
3250

33-
For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands should give the same text output and very similar prefill logits:
51+
For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands
52+
should give the same text output and very similar prefill logits:
3453

3554
- Get the reference output:
3655
```bash
37-
python3 scripts/playground/reference_hf.py --model-path [new model] --model-type {text,vlm}
56+
python3 scripts/playground/reference_hf.py --model-path [new model] --model-type {text,mllm}
3857
```
3958
- Get the SGLang output:
4059
```bash
@@ -43,7 +62,10 @@ For interactive debugging, compare the outputs of Hugging Face/Transformers and
4362

4463
### Add the Model to the Test Suite
4564

46-
To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py) file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU, MMMU-Pro, etc.) in your PR.
65+
To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in
66+
the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py)
67+
file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU,
68+
MMMU-Pro, etc.) in your PR.
4769

4870
This is the command to test a new model on your local machine:
4971

@@ -53,26 +75,29 @@ ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerati
5375

5476
## Port a Model from vLLM to SGLang
5577

56-
The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models from vLLM to SGLang.
78+
The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable
79+
resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models
80+
from vLLM to SGLang.
5781

5882
To port a model from vLLM to SGLang:
5983

6084
- Compare these two files for guidance:
61-
- [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py)
62-
- [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py)
85+
- [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py)
86+
- [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py)
6387
- The major differences include:
64-
- **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`).
65-
- **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.**
66-
- **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.**
67-
- **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers.
68-
- **Remove `Sample`.**
69-
- **Change the `forward()` functions** and add a `forward_batch()` method.
70-
- **Add `EntryClass`** at the end.
71-
- **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components.
88+
- **Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`).
89+
- **Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.**
90+
- **Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.**
91+
- **Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers.
92+
- **Remove `Sample`.**
93+
- **Change the `forward()` functions** and add a `forward_batch()` method.
94+
- **Add `EntryClass`** at the end.
95+
- **Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components.
7296

7397
## Registering an External Model Implementation
7498

75-
In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server. This allows you to integrate your model without modifying the source code.
99+
In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server.
100+
This allows you to integrate your model without modifying the source code.
76101

77102
For example:
78103

@@ -101,4 +126,5 @@ launch_server(server_args)
101126

102127
---
103128

104-
By following these guidelines, you can add support for new language models and vision-language models in SGLang and ensure they are thoroughly tested and easily integrated into the system.
129+
By following these guidelines, you can add support for new language models and multimodal large language models in
130+
SGLang and ensure they are thoroughly tested and easily integrated into the system.

docs/supported_models/vision_language_models.md

Lines changed: 0 additions & 28 deletions
This file was deleted.

0 commit comments

Comments
 (0)