[model] Add MiniCPM-V 4.6 model support#137
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for the MiniCPM-V-4.6 (minicpmv4_6) model, including its configuration parsing, model registration, and bridge/vision implementations. The feedback points out potential dimension mismatch bugs when num_beams > 1 during inference, where pixel_values and pixel_values_videos are sliced to [:1] but their corresponding target_sizes and target_sizes_videos are not. Applying consistent slicing to the target sizes is recommended to prevent runtime errors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| vision_output = self.model_cls.get_image_features( | ||
| self, pixel_values[:1].to(dtype=self.vision_tower.dtype), target_sizes) |
There was a problem hiding this comment.
当 num_beams > 1(例如在使用 Beam Search 或 Batch Size > 1 进行推理/生成)时,pixel_values 被切片为 pixel_values[:1](即 Batch Size 为 1),但 target_sizes 却没有进行相应的切片。这会导致 get_image_features 内部因为输入维度不匹配而报错或产生错误结果。建议对 target_sizes 也进行 [:1] 切片,以保持与 pixel_values[:1] 的维度一致。
| vision_output = self.model_cls.get_image_features( | |
| self, pixel_values[:1].to(dtype=self.vision_tower.dtype), target_sizes) | |
| vision_output = self.model_cls.get_image_features( | |
| self, pixel_values[:1].to(dtype=self.vision_tower.dtype), target_sizes[:1] if target_sizes is not None else None) |
| vision_output = self.model_cls.get_video_features( | ||
| self, pixel_values_videos[:1].to(dtype=self.vision_tower.dtype), target_sizes_videos) |
There was a problem hiding this comment.
与上述图片处理逻辑类似,当 num_beams > 1 时,pixel_values_videos 被切片为 pixel_values_videos[:1],但 target_sizes_videos 却没有进行相应的切片。这会导致 get_video_features 内部因为输入维度不匹配而报错。建议对 target_sizes_videos 也进行 [:1] 切片,以保持维度一致。
| vision_output = self.model_cls.get_video_features( | |
| self, pixel_values_videos[:1].to(dtype=self.vision_tower.dtype), target_sizes_videos) | |
| vision_output = self.model_cls.get_video_features( | |
| self, pixel_values_videos[:1].to(dtype=self.vision_tower.dtype), target_sizes_videos[:1] if target_sizes_videos is not None else None) |
|
在transformers源码中就是这么写的,可以查看https://github.com/huggingface/transformers/blob/main/src/transformers/models/minicpmv4_6/modeling_minicpmv4_6.py#L740 if pixel_values is not None and self.config.image_token_id is not None:
# Pixels are always `1` in first dim due to NaViT packing, and we don't
# want to waste compute processing the same image `num_beams` times. Hack until
# @raushan adds support for encoding images once same waay as in enc-dec models
num_beams = pixel_values.shape[0]
vision_output = self.get_image_features(pixel_values[:1], target_sizes, downsample_mode=downsample_mode)
image_features = (
torch.cat(vision_output.pooler_output, dim=0)
.to(device=inputs_embeds.device, dtype=inputs_embeds.dtype)
.repeat(num_beams, 1)
)
mask = self.get_placeholder_mask(input_ids, inputs_embeds, image_features, self.config.image_token_id)
inputs_embeds = inputs_embeds.masked_scatter(mask, image_features)
if pixel_values_videos is not None and self.config.video_token_id is not None:
num_beams = pixel_values_videos.shape[0]
vision_output = self.get_video_features(
pixel_values_videos[:1], target_sizes_videos, downsample_mode=downsample_mode
)
video_features = (
torch.cat(vision_output.pooler_output, dim=0)
.to(device=inputs_embeds.device, dtype=inputs_embeds.dtype)
.repeat(num_beams, 1)
)
mask = self.get_placeholder_mask(input_ids, inputs_embeds, video_features, self.config.video_token_id)
inputs_embeds = inputs_embeds.masked_scatter(mask, video_features) |
|
please run: |
|
thanks! |
该 PR 为 Megatron-Core 添加了 MiniCPM-V 4.6 多模态模型的支持,包括视觉编码器、视觉-语言投影器(merger)的桥接实现,以及相应的配置处理与测试验证。
主要改动
1. 模型桥接实现
新增
MiniCPMV46Vit和MiniCPMV46Bridge类:MiniCPMV46Vit:继承HuggingFaceVit基类module_mapping = {'model.vision_tower': 'vision_tower', 'model.merger': 'merger'},对齐 HF 源码中的属性命名,便于权重自动转换prepare_model():从 HF transformers 实例化MiniCPMV4_6VisionModel和MiniCPMV4_6Merger,保持 dtype 一致性get_inputs_embeds():统一处理纯文本、图像、视频三种输入模式,通过masked_scatter将视觉特征注入文本嵌入patch_hf_config()上下文管理器确保 HF 源码方法在正确的配置上下文中执行MiniCPMV46Bridge:继承Qwen3NextGDNBridgeMixinmodel.language_model.layers、embed_tokens.weight、norm.weight等)2. 配置处理
在
parser.py中为minicpmv4_6添加:qk_layernorm支持(与 Qwen3.5 行为一致)layernorm_zero_centered_gamma=True、attention_output_gate=True、experimental_attention_variant='gated_delta_net'3. 注册与导入
Qwen3NextLoader作为加载器,注册到ModelType.minicpmv4_6,映射 HF 模型类型['minicpmv4_6']gpts/__init__.py中添加模块导入4. 测试
新增
test_minicpmv4_6()测试函数,验证 HF 模型与 Megatron-Core 模型的数值等价性:所有差异远低于 0.1 的阈值,实现正确性得到验证。
详细的测试结果日志如下: