You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|**Qwen-VL** (Qwen2 series) |`Qwen/Qwen2.5-VL-7B-Instruct`|`qwen2-vl`| Alibaba’s vision-language extension of Qwen; for example, Qwen2.5-VL (7B and larger variants) can analyze and converse about image content. |
20
+
|**DeepSeek-VL2**|`deepseek-ai/deepseek-vl2`|`deepseek-vl2`| Vision-language variant of DeepSeek (with a dedicated image processor), enabling advanced multimodal reasoning on image and text inputs. |
21
+
|**Janus-Pro** (1B, 7B) |`deepseek-ai/Janus-Pro-7B`|`janus-pro`| DeepSeek’s open-source multimodal model capable of both image understanding and generation. Janus-Pro employs a decoupled architecture for separate visual encoding paths, enhancing performance in both tasks. |
22
+
|**MiniCPM-V / MiniCPM-o**|`openbmb/MiniCPM-V-2_6`|`minicpmv`| MiniCPM-V (2.6, ~8B) supports image inputs, and MiniCPM-o adds audio/video; these multimodal LLMs are optimized for end-side deployment on mobile/edge devices. |
23
+
|**Llama 3.2 Vision** (11B) |`meta-llama/Llama-3.2-11B-Vision-Instruct`|`llama_3_vision`| Vision-enabled variant of Llama 3 (11B) that accepts image inputs for visual question answering and other multimodal tasks. |
24
+
|**LLaVA** (v1.5 & v1.6) |*e.g.*`liuhaotian/llava-v1.5-13b`|`vicuna_v1.1`| Open vision-chat models that add an image encoder to LLaMA/Vicuna (e.g. LLaMA2 13B) for following multimodal instruction prompts. |
25
+
|**LLaVA-NeXT** (8B, 72B) |`lmms-lab/llava-next-72b`|`chatml-llava`| Improved LLaVA models (with an 8B Llama3 version and a 72B version) offering enhanced visual instruction-following and accuracy on multimodal benchmarks. |
26
+
|**LLaVA-OneVision**|`lmms-lab/llava-onevision-qwen2-7b-ov`|`chatml-llava`| Enhanced LLaVA variant integrating Qwen as the backbone; supports multiple images (and even video frames) as inputs via an OpenAI Vision API-compatible format. |
27
+
|**Gemma 3 (Multimodal)**|`google/gemma-3-4b-it`|`gemma-it`| Gemma 3’s larger models (4B, 12B, 27B) accept images (each image encoded as 256 tokens) alongside text in a combined 128K-token context. |
28
+
|**Kimi-VL** (A3B) |`moonshotai/Kimi-VL-A3B-Instruct`|`kimi-vl`| Kimi-VL is a multimodal model that can understand and generate text from images. |
This document explains how to add support for new language models and vision‐language models (VLMs) in SGLang. It also covers how to test new models and register external implementations.
3
+
This document explains how to add support for new language models and multimodal large language models (mllms) in
4
+
SGLang. It also covers how to test new models and register external implementations.
4
5
5
6
## How to Support a new Language Model
6
7
7
-
To support a new model in SGLang, you only need to add a single file under the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn from existing model implementations and create a new file for your model. For most models, you should be able to find a similar model to start with (e.g., starting from Llama). Also refer how to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang)
8
+
To support a new model in SGLang, you only need to add a single file under
9
+
the [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn
10
+
from existing model implementations and create a new file for your model. For most models, you should be able to find a
11
+
similar model to start with (e.g., starting from Llama). Also refer how
12
+
to [port a Model from vLLM to SGLang](#port-a-model-from-vllm-to-sglang)
8
13
9
-
## How to Support a new Vision-Language model
14
+
## How to Support a new Multimodal Large Language Model
10
15
11
-
To support a new vision-language model (vLM) in SGLang, there are several key components in addition to the standard LLM support:
16
+
To support a new multimodal large language model (MLLM) in SGLang, there are several key components in addition to the
17
+
standard LLM support:
12
18
13
19
1.**Register your new model as multimodal**:
14
-
Extend `is_multimodal_model` in [model_config.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py) to return `True` for your model.
20
+
Extend `is_multimodal_model`
21
+
in [model_config.py](https://github.com/sgl-project/sglang/blob/0ab3f437aba729b348a683ab32b35b214456efc7/python/sglang/srt/configs/model_config.py#L561)
22
+
to return `True` for your model.
15
23
16
-
2.**Process Images**:
17
-
Define a new `Processor` class that inherits from `BaseProcessor` and register this processor as your model’s dedicated processor. See [multimodal_processor.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py) for more details.
24
+
2.**Register a new chat-template**
25
+
See [conversation.py](https://github.com/sgl-project/sglang/blob/86a779dbe9e815c02f71ea82574608f6eae016b5/python/sglang/srt/conversation.py)
18
26
19
-
3.**Handle Image Tokens**:
20
-
Implement a `pad_input_ids` function for your new model. In this function, image tokens in the prompt should be expanded and replaced with image-hashes so that SGLang can recognize different images when using `RadixAttention`.
27
+
3.**Multimodal Data Processor**:
28
+
Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your
29
+
model’s dedicated processor.
30
+
See [multimodal_processor.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py)
31
+
for more details.
21
32
22
-
4.**Replace Vision Attention**:
23
-
Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.
33
+
4.**Handle Multimodal Tokens**:
34
+
Implement a `pad_input_ids` function for your new model. In this function, multimodal tokens in the prompt should be
35
+
expanded (if necessary) and padded with multimodal-data-hashes so that SGLang can recognize different multimodal data
36
+
with `RadixAttention`.
24
37
25
-
You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or other vLM implementations. These models demonstrate how to correctly handle both multimodal and textual inputs.
38
+
5.**Adapt to Vision Attention**:
39
+
Adapt the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.
26
40
27
-
You should test the new vLM locally against Hugging Face models. See the [`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) benchmark for an example.
41
+
You can refer to [Qwen2VL](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/qwen2_vl.py) or
42
+
other mllm implementations. These models demonstrate how to correctly handle both multimodal and textual inputs.
43
+
44
+
You should test the new MLLM locally against Hugging Face models. See the [
45
+
`mmmu`](https://github.com/sgl-project/sglang/tree/main/benchmark/mmmu) benchmark for an example.
28
46
29
47
## Test the Correctness
30
48
31
49
### Interactive Debugging
32
50
33
-
For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands should give the same text output and very similar prefill logits:
51
+
For interactive debugging, compare the outputs of Hugging Face/Transformers and SGLang. The following two commands
52
+
should give the same text output and very similar prefill logits:
@@ -43,7 +62,10 @@ For interactive debugging, compare the outputs of Hugging Face/Transformers and
43
62
44
63
### Add the Model to the Test Suite
45
64
46
-
To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py) file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU, MMMU-Pro, etc.) in your PR.
65
+
To ensure the new model is well maintained, add it to the test suite by including it in the `ALL_OTHER_MODELS` list in
66
+
the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py)
67
+
file, test the new model on your local machine and report the results on demonstrative benchmarks (GSM8K, MMLU, MMMU,
68
+
MMMU-Pro, etc.) in your PR.
47
69
48
70
This is the command to test a new model on your local machine:
The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models from vLLM to SGLang.
78
+
The [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models) is a valuable
79
+
resource, as vLLM covers many models. SGLang reuses vLLM’s interface and some layers, making it easier to port models
-**Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`).
65
-
-**Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.**
66
-
-**Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.**
67
-
-**Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers.
68
-
-**Remove `Sample`.**
69
-
-**Change the `forward()` functions** and add a `forward_batch()` method.
70
-
-**Add `EntryClass`** at the end.
71
-
-**Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components.
88
+
-**Replace vLLM’s `Attention` with `RadixAttention`** (ensure you pass `layer_id` to `RadixAttention`).
89
+
-**Replace vLLM’s `LogitsProcessor` with SGLang’s `LogitsProcessor`.**
90
+
-**Replace the multi-headed `Attention` of ViT with SGLang’s `VisionAttention`.**
91
+
-**Replace other vLLM layers** (such as `RMSNorm`, `SiluAndMul`) with SGLang layers.
92
+
-**Remove `Sample`.**
93
+
-**Change the `forward()` functions** and add a `forward_batch()` method.
94
+
-**Add `EntryClass`** at the end.
95
+
-**Ensure that the new implementation uses only SGLang components** and does not rely on any vLLM components.
72
96
73
97
## Registering an External Model Implementation
74
98
75
-
In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server. This allows you to integrate your model without modifying the source code.
99
+
In addition to the methods above, you can register your new model with the `ModelRegistry` before launching the server.
100
+
This allows you to integrate your model without modifying the source code.
76
101
77
102
For example:
78
103
@@ -101,4 +126,5 @@ launch_server(server_args)
101
126
102
127
---
103
128
104
-
By following these guidelines, you can add support for new language models and vision-language models in SGLang and ensure they are thoroughly tested and easily integrated into the system.
129
+
By following these guidelines, you can add support for new language models and multimodal large language models in
130
+
SGLang and ensure they are thoroughly tested and easily integrated into the system.
0 commit comments