Skip to content

[model] Add MiniCPM-V 4.6 model support#137

Merged
Jintao-Huang merged 1 commit into
modelscope:mainfrom
randydl:dev
Jun 28, 2026
Merged

[model] Add MiniCPM-V 4.6 model support#137
Jintao-Huang merged 1 commit into
modelscope:mainfrom
randydl:dev

Conversation

@randydl

@randydl randydl commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

该 PR 为 Megatron-Core 添加了 MiniCPM-V 4.6 多模态模型的支持,包括视觉编码器、视觉-语言投影器(merger)的桥接实现,以及相应的配置处理与测试验证。

主要改动

1. 模型桥接实现

新增 MiniCPMV46VitMiniCPMV46Bridge 类:

  • MiniCPMV46Vit:继承 HuggingFaceVit 基类

    • 定义 module_mapping = {'model.vision_tower': 'vision_tower', 'model.merger': 'merger'},对齐 HF 源码中的属性命名,便于权重自动转换
    • prepare_model():从 HF transformers 实例化 MiniCPMV4_6VisionModelMiniCPMV4_6Merger,保持 dtype 一致性
    • get_inputs_embeds():统一处理纯文本、图像、视频三种输入模式,通过 masked_scatter 将视觉特征注入文本嵌入
    • 通过 patch_hf_config() 上下文管理器确保 HF 源码方法在正确的配置上下文中执行
  • MiniCPMV46Bridge:继承 Qwen3NextGDNBridgeMixin

    • 复用 Qwen3.5 的权重转换逻辑
    • 设置正确的 HF 状态字典前缀(model.language_model.layersembed_tokens.weightnorm.weight 等)

2. 配置处理

parser.py 中为 minicpmv4_6 添加:

  • qk_layernorm 支持(与 Qwen3.5 行为一致)
  • GDN 配置:layernorm_zero_centered_gamma=Trueattention_output_gate=Trueexperimental_attention_variant='gated_delta_net'

3. 注册与导入

  • 使用 Qwen3NextLoader 作为加载器,注册到 ModelType.minicpmv4_6,映射 HF 模型类型 ['minicpmv4_6']
  • gpts/__init__.py 中添加模块导入

4. 测试

新增 test_minicpmv4_6() 测试函数,验证 HF 模型与 Megatron-Core 模型的数值等价性:

指标 数值
mean_diff 0.00198
max_diff 0.0957
mean_diff(含 loss) 0.00162
max_diff(含 loss) 0.0861
token_diff 0

所有差异远低于 0.1 的阈值,实现正确性得到验证。

详细的测试结果日志如下:

[INFO:swift] Conv3d patched successfully
[INFO:swift] Successfully registered `/nas_user/app.e0016372/projects/ms-swift/swift/dataset/data/dataset_info.json`.
[INFO:swift] rank: 0, local_rank: 0, world_size: 1, local_world_size: 1
[transformers] `torch_dtype` is deprecated! Use `dtype` instead!
[INFO:swift] Setting args.lazy_tokenize: True
[INFO:swift] args.output_dir: `/nas_user/app.e0016372/projects/mcore-bridge/MiniCPM-V-4.6-mcore`
[INFO:swift] args: ExportArguments(use_ray=False, ray_exp_name=None, device_groups=None, model='/nas_train/app.e0016372/models/openbmb/MiniCPM-V-4.6', model_type='minicpmv4_6', model_revision=None, task_type='causal_lm', torch_dtype=torch.bfloat16, attn_impl=None, experts_impl=None, new_special_tokens=[], num_labels=None, problem_type=None, rope_scaling=None, device_map=None, max_memory={}, max_model_len=None, local_repo_path=None, init_strategy=None, template='minicpmv4_6', system=None, max_length=262144, truncation_strategy='delete', max_pixels=None, agent_template=None, norm_bbox=None, use_chat_template=True, padding_side='right', padding_free=False, loss_scale='default+ignore_empty_think', sequence_parallel_size=1, is_binary_loss_scale=None, template_backend='swift', response_prefix=None, enable_thinking=None, preserve_thinking=None, add_non_thinking_prefix=True, disable_ignore_empty_think=False, dataset=[], val_dataset=[], cached_dataset=[], cached_val_dataset=[], split_dataset_ratio=0.0, data_seed=42, dataset_num_proc=1, load_from_cache_file=False, dataset_shuffle=True, val_dataset_shuffle=False, streaming=False, interleave_prob=None, stopping_strategy='first_exhausted', shuffle_buffer_size=1000, download_mode='reuse_dataset_if_exists', columns={}, strict=False, remove_unused_columns=True, disable_auto_column_mapping=False, model_name=None, model_author=None, custom_dataset_info=[], quant_method=None, quant_bits=None, hqq_axis=None, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_quant_storage=None, max_new_tokens=None, temperature=None, top_k=None, top_p=None, repetition_penalty=None, num_beams=1, stream=False, stop_words=[], logprobs=False, top_logprobs=None, structured_outputs_regex=None, tuner_backend='peft', tuner_type='lora', adapters=[], external_plugins=[], custom_register_path=[], seed=42, model_kwargs={}, enable_npu_model_patch=True, load_args=True, load_data_args=False, packing=False, packing_length=None, packing_num_proc=1, lazy_tokenize=True, use_hf=False, hub_token=None, ddp_timeout=18000000, ddp_backend=None, ignore_args_error=False, use_swift_lora=False, merge_lora=False, safe_serialization=True, max_shard_size='5GB', output_dir='/nas_user/app.e0016372/projects/mcore-bridge/MiniCPM-V-4.6-mcore', quant_n_samples=256, quant_batch_size=1, group_size=128, to_cached_dataset=False, template_mode='train', to_ollama=False, to_mcore=True, to_hf=False, mcore_model=None, mcore_adapter=None, thread_count=None, test_convert_precision=True, test_convert_dtype=torch.float32, push_to_hub=False, hub_model_id=None, hub_private_repo=False, commit_message='update files', to_peft_format=False, exist_ok=True)
[INFO:swift] Global seed set to 42
[INFO:swift] Start time of running main: 2026-06-26 16:18:57.172033
[INFO:swift] swift.__version__: 4.4.0.dev0
[INFO:swift] mcore_bridge.__version__: 1.6.0.dev0
[INFO:swift] megatron.core.__version__: 0.16.2
[INFO:mcore_bridge] Setting USE_MCORE_GDN: True. You can adjust this hyperparameter through the environment variable: `USE_MCORE_GDN`.
[INFO:swift] Patch tp_plan.
[INFO:swift] model_kwargs: {'device_map': 'auto', 'dtype': torch.bfloat16, 'experts_implementation': 'eager'}
Loading weights: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 779/779 [00:00<00:00, 1184.42it/s]
[INFO:swift] default_system: None
[INFO:swift] max_length: 262144
[INFO:swift] response_prefix: None
[INFO:swift] agent_template: None
[INFO:swift] norm_bbox: norm1000
[INFO:swift] Setting ROOT_IMAGE_DIR: None. You can adjust this hyperparameter through the environment variable: `ROOT_IMAGE_DIR`.
[INFO:swift] Setting downsample_mode: 16x. You can adjust this hyperparameter through the environment variable: `DOWNSAMPLE_MODE`.
[INFO:swift] Setting max_slice_nums: 9. You can adjust this hyperparameter through the environment variable: `MAX_SLICE_NUMS`.
[INFO:swift] Setting video_max_slice_nums: 1. You can adjust this hyperparameter through the environment variable: `VIDEO_MAX_SLICE_NUMS`.
[INFO:swift] Setting max_num_frames: 128. You can adjust this hyperparameter through the environment variable: `MAX_NUM_FRAMES`.
[INFO:swift] Setting stack_frames: 1. You can adjust this hyperparameter through the environment variable: `STACK_FRAMES`.
[INFO:swift] Setting torch_dtype: torch.bfloat16
[INFO:swift] freeze_parameters: ['visual.vision_tower', 'visual.merger']
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[INFO:swift] TP: 1, PP: 1, VPP: None, CP: 1, EP: 1, ETP: 1
[INFO:swift] Setting random seeds to 42.
/nas_user/app.e0016372/projects/Megatron-LM/megatron/core/transformer/transformer_config.py:1705: UserWarning: full scope is deprecated. Use empty cuda_graph_scope to capture the whole layer.
  warnings.warn(
[INFO:swift] Megatron model created successfully.
Loading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 176.16it/s]
[INFO:swift] Successfully transferred HF model weights to MG model.
[INFO:swift] n_parameter: 689
[INFO:swift] total_sum: 19327673.06329918
[INFO:swift] zero_count: 0
[INFO:swift] n_parameter: 779
[INFO:swift] total_sum: 19327673.104818344
[INFO:swift] zero_count: 0
You shouldn't move a model that is dispatched using accelerate hooks.
/nas_user/app.e0016372/miniforge3/envs/swift/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1692: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
/nas_user/app.e0016372/miniforge3/envs/swift/lib/python3.12/site-packages/torch/_dynamo/variables/functions.py:1598: UserWarning: Dynamo does not know how to trace the builtin `transformer_engine_torch.pybind11_detail_function_record_v1_system_libstdcpp_gxx_abi_1xxx_use_cxx11_abi_1.rmsnorm_fwd.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
token_mean_diff: tensor([[0.0006, 0.0017, 0.0010, 0.0018, 0.0024, 0.0025, 0.0029, 0.0020, 0.0012,
         0.0010, 0.0014, 0.0011, 0.0011, 0.0013, 0.0012, 0.0011, 0.0011, 0.0013,
         0.0012, 0.0016, 0.0012, 0.0011, 0.0016, 0.0016, 0.0014, 0.0016, 0.0014,
         0.0011, 0.0023, 0.0016, 0.0013, 0.0011, 0.0011, 0.0011, 0.0012, 0.0012,
         0.0012, 0.0011, 0.0019, 0.0016, 0.0013, 0.0016, 0.0015, 0.0014, 0.0017,
         0.0015, 0.0013, 0.0014, 0.0013, 0.0020, 0.0013, 0.0015, 0.0014, 0.0013,
         0.0022, 0.0017, 0.0013, 0.0016, 0.0012, 0.0014, 0.0013, 0.0017, 0.0015,
         0.0016, 0.0017, 0.0015, 0.0019, 0.0011, 0.0011, 0.0015, 0.0015, 0.0037,
         0.0030, 0.0026, 0.0019, 0.0015, 0.0014, 0.0021, 0.0013, 0.0015, 0.0020,
         0.0016, 0.0016, 0.0019, 0.0012, 0.0016, 0.0018, 0.0014, 0.0014, 0.0013,
         0.0013, 0.0027, 0.0012, 0.0016, 0.0016, 0.0021, 0.0044, 0.0017, 0.0021,
         0.0020, 0.0014, 0.0029, 0.0019, 0.0022, 0.0031, 0.0017, 0.0027, 0.0030,
         0.0018, 0.0027, 0.0015, 0.0017, 0.0023, 0.0022, 0.0019, 0.0019, 0.0030,
         0.0046, 0.0016, 0.0016, 0.0067, 0.0018, 0.0017, 0.0022, 0.0018, 0.0019,
         0.0015, 0.0020, 0.0015, 0.0018, 0.0014, 0.0021, 0.0046, 0.0018, 0.0016,
         0.0011, 0.0023, 0.0026, 0.0043, 0.0018, 0.0022, 0.0016, 0.0018, 0.0017,
         0.0019, 0.0025, 0.0019, 0.0020, 0.0015, 0.0022, 0.0015, 0.0013, 0.0017,
         0.0014, 0.0025, 0.0018, 0.0016, 0.0014, 0.0018, 0.0019, 0.0015, 0.0014,
         0.0017, 0.0013, 0.0017, 0.0056, 0.0017, 0.0021, 0.0021, 0.0017, 0.0017,
         0.0026, 0.0030, 0.0017, 0.0014, 0.0017, 0.0030, 0.0015, 0.0018, 0.0021,
         0.0019, 0.0026, 0.0053, 0.0021, 0.0037, 0.0023, 0.0028, 0.0029, 0.0015,
         0.0015, 0.0022, 0.0018, 0.0015, 0.0017, 0.0010, 0.0017, 0.0014, 0.0011,
         0.0020, 0.0030, 0.0026, 0.0015, 0.0019, 0.0017, 0.0026, 0.0027, 0.0024,
         0.0017, 0.0021, 0.0021, 0.0018, 0.0020, 0.0016, 0.0028, 0.0013, 0.0027,
         0.0016, 0.0021, 0.0029, 0.0024, 0.0018, 0.0017, 0.0014, 0.0019, 0.0015,
         0.0013, 0.0025, 0.0015, 0.0015, 0.0014, 0.0018, 0.0016, 0.0019, 0.0031,
         0.0027, 0.0018, 0.0032, 0.0018, 0.0015, 0.0025, 0.0019, 0.0030, 0.0018,
         0.0026, 0.0018, 0.0021, 0.0026, 0.0023, 0.0022, 0.0042, 0.0041, 0.0015,
         0.0014, 0.0035, 0.0022, 0.0014, 0.0015, 0.0020, 0.0015, 0.0017, 0.0017,
         0.0017, 0.0026, 0.0043, 0.0027, 0.0017, 0.0012, 0.0017, 0.0042, 0.0049,
         0.0035, 0.0032, 0.0023, 0.0023, 0.0021, 0.0023, 0.0036, 0.0036, 0.0012,
         0.0021, 0.0016, 0.0016, 0.0019, 0.0033, 0.0017, 0.0023, 0.0020, 0.0014,
         0.0015, 0.0018, 0.0019, 0.0088, 0.0020, 0.0017, 0.0016, 0.0016, 0.0046,
         0.0024, 0.0032, 0.0019, 0.0046, 0.0028, 0.0015, 0.0033, 0.0044, 0.0015,
         0.0018, 0.0024, 0.0023, 0.0021, 0.0030, 0.0020, 0.0019, 0.0023, 0.0022,
         0.0020, 0.0037, 0.0031, 0.0028, 0.0048, 0.0019, 0.0016, 0.0023, 0.0022,
         0.0021, 0.0016, 0.0029, 0.0013, 0.0021, 0.0049, 0.0018, 0.0015, 0.0026,
         0.0017, 0.0015, 0.0018, 0.0028, 0.0013, 0.0015, 0.0014, 0.0011, 0.0024,
         0.0019, 0.0024, 0.0042, 0.0047, 0.0030, 0.0057, 0.0042, 0.0052, 0.0029,
         0.0017, 0.0017, 0.0017, 0.0014, 0.0022, 0.0013, 0.0014, 0.0018, 0.0018,
         0.0013, 0.0014, 0.0017, 0.0015, 0.0011, 0.0014, 0.0016, 0.0014, 0.0013,
         0.0014, 0.0013, 0.0014, 0.0014, 0.0012, 0.0013, 0.0023, 0.0018, 0.0020,
         0.0016, 0.0014, 0.0015, 0.0020, 0.0016, 0.0015, 0.0015, 0.0011, 0.0011,
         0.0013, 0.0015, 0.0017, 0.0024, 0.0013, 0.0013, 0.0012, 0.0014, 0.0013,
         0.0014, 0.0013, 0.0013, 0.0015, 0.0014, 0.0014, 0.0015, 0.0020, 0.0015,
         0.0017, 0.0012, 0.0016, 0.0013, 0.0017, 0.0013, 0.0012, 0.0015, 0.0013,
         0.0014, 0.0018, 0.0021, 0.0012, 0.0012, 0.0012, 0.0014, 0.0015, 0.0019,
         0.0018, 0.0013, 0.0023, 0.0019, 0.0016, 0.0016, 0.0016, 0.0019, 0.0015,
         0.0014, 0.0013, 0.0012, 0.0019, 0.0026, 0.0014, 0.0015, 0.0016, 0.0019,
         0.0017, 0.0014, 0.0018, 0.0016, 0.0019, 0.0015, 0.0018, 0.0013, 0.0022,
         0.0017, 0.0013, 0.0026, 0.0011, 0.0015, 0.0022, 0.0011, 0.0012, 0.0010,
         0.0011, 0.0015, 0.0025, 0.0013, 0.0016, 0.0027, 0.0060, 0.0054]],
       device='cuda:0')
mean_diff: 0.0019767824560403824, max_diff: 0.09569358825683594
mean_diff (with loss): 0.0016160453669726849, max_diff (with loss): 0.08607101440429688 (Please check that mean_diff (with loss) is less than 0.1).
hf_tokens: [248090, 3710, 95772, 1510, 1892, 313, 248080, 2099, 81402, 69942, 74035, 32420, 23469, 97857, 14791, 69942, 23469, 74035, 106531, 12261, 15392, 17676, 21168, 3160, 23469, 73535, 76828, 6121, 128066, 17676, 69942, 46596, 39269, 2330, 6213, 2330, 6105, 21168, 69942, 69942, 3221, 61909, 18563, 115665, 74035, 98339, 6220, 24443, 18268, 142878, 9781, 9781, 62763, 15392, 39269, 9781, 76828, 39269, 74035, 21520, 22099, 65283, 18268, 3160, 7658, 24443, 74035, 65283, 466, 27583, 198, 198, 2099, 13181, 3160, 65283, 37741, 54166, 5983, 15325, 27390, 14791, 13643, 23469, 119704, 23469, 74035, 103828, 74035, 6121, 112464, 101751, 17676, 112648, 76828, 12261, 5215, 69942, 2047, 128066, 12317, 74035, 101742, 123204, 49530, 31059, 14791, 15392, 919, 10558, 12317, 5474, 12261, 112464, 74035, 65283, 21520, 40589, 128066, 15392, 98444, 7099, 118147, 95779, 74035, 61909, 2330, 6105, 2050, 20002, 15708, 21870, 59097, 12261, 6105, 30615, 3349, 198, 1245, 95872, 69942, 5983, 3160, 9781, 66445, 74973, 14791, 3444, 74035, 74035, 96165, 17676, 95988, 96348, 55232, 12317, 9485, 99490, 38981, 919, 95988, 680, 98339, 6121, 123204, 109729, 142878, 12317, 98622, 69942, 69942, 74035, 112401, 127883, 18039, 919, 2438, 112648, 2438, 15392, 6225, 38981, 9140, 15325, 96237, 14170, 98444, 2330, 2330, 21172, 6105, 98500, 2094, 6213, 30615, 6668, 6105, 30615, 66445, 466, 18268, 65283, 1076, 198, 1919, 248082, 100865, 27390, 9781, 369, 127829, 96738, 71702, 20002, 112464, 74035, 30, 26524, 99521, 3460, 9781, 9781, 69942, 17699, 4022, 147390, 65283, 12317, 39317, 17676, 2407, 11051, 11051, 61978, 104427, 62763, 6326, 9781, 69942, 27390, 123204, 39508, 0, 10970, 39317, 101751, 69942, 4661, 123204, 18833, 74035, 65283, 321, 96877, 109566, 466, 466, 14, 52051, 6213, 40589, 57407, 17188, 369, 88768, 919, 6121, 6213, 51343, 1141, 198, 1245, 95872, 4275, 54322, 96738, 12261, 3460, 95140, 65493, 18563, 61909, 71702, 142878, 98339, 74035, 96348, 75581, 9781, 9781, 12317, 95940, 3221, 3000, 63068, 69942, 47160, 2407, 6977, 62763, 5077, 579, 321, 112464, 6213, 116044, 43314, 30, 11362, 11362, 112401, 18563, 321, 4022, 4022, 15392, 14791, 3221, 69942, 25586, 35132, 28520, 9781, 466, 5474, 37737, 7099, 104145, 8153, 14791, 39269, 98500, 8153, 12454, 41567, 198, 198, 1919, 279, 74035, 321, 271, 198, 760, 10092, 271, 760, 198, 13962, 271, 760, 2099, 4774, 264, 18268, 5072, 314, 264, 2526, 440, 3349, 11, 13, 561, 74035, 682, 3349, 6311, 321, 17024, 22162, 440, 12103, 3565, 52850, 383, 7695, 2094, 383, 1141, 3460, 321, 23469, 13, 11116, 3349, 513, 3349, 11, 6105, 11, 264, 264, 6105, 6105, 37747, 421, 25791, 680, 2272, 279, 8153, 61909, 13, 1070, 13, 561, 74035, 579, 23469, 369, 2526, 321, 17676, 11, 321, 1141, 682, 34036, 11, 34036, 39317, 382, 421, 494, 1141, 3008, 314, 1141, 3460, 13, 561, 23469, 369, 54166, 11, 864, 6326, 310, 279, 74035, 579, 39269, 11, 6866, 424, 279, 39742, 1406, 314, 279, 2099, 13, 561, 7830, 20158, 369, 799, 314, 3799, 22967, 321, 54322, 11, 248046, 248046, 248069]
mg_tokens: [248090, 3710, 95772, 1510, 1892, 313, 248080, 2099, 81402, 69942, 74035, 32420, 23469, 97857, 14791, 69942, 23469, 74035, 106531, 12261, 15392, 17676, 21168, 3160, 23469, 73535, 76828, 6121, 128066, 17676, 69942, 46596, 39269, 2330, 6213, 2330, 6105, 21168, 69942, 69942, 3221, 61909, 18563, 115665, 74035, 98339, 6220, 24443, 18268, 142878, 9781, 9781, 62763, 15392, 39269, 9781, 76828, 39269, 74035, 21520, 22099, 65283, 18268, 3160, 7658, 24443, 74035, 65283, 466, 27583, 198, 198, 2099, 13181, 3160, 65283, 37741, 54166, 5983, 15325, 27390, 14791, 13643, 23469, 119704, 23469, 74035, 103828, 74035, 6121, 112464, 101751, 17676, 112648, 76828, 12261, 5215, 69942, 2047, 128066, 12317, 74035, 101742, 123204, 49530, 31059, 14791, 15392, 919, 10558, 12317, 5474, 12261, 112464, 74035, 65283, 21520, 40589, 128066, 15392, 98444, 7099, 118147, 95779, 74035, 61909, 2330, 6105, 2050, 20002, 15708, 21870, 59097, 12261, 6105, 30615, 3349, 198, 1245, 95872, 69942, 5983, 3160, 9781, 66445, 74973, 14791, 3444, 74035, 74035, 96165, 17676, 95988, 96348, 55232, 12317, 9485, 99490, 38981, 919, 95988, 680, 98339, 6121, 123204, 109729, 142878, 12317, 98622, 69942, 69942, 74035, 112401, 127883, 18039, 919, 2438, 112648, 2438, 15392, 6225, 38981, 9140, 15325, 96237, 14170, 98444, 2330, 2330, 21172, 6105, 98500, 2094, 6213, 30615, 6668, 6105, 30615, 66445, 466, 18268, 65283, 1076, 198, 1919, 248082, 100865, 27390, 9781, 369, 127829, 96738, 71702, 20002, 112464, 74035, 30, 26524, 99521, 3460, 9781, 9781, 69942, 17699, 4022, 147390, 65283, 12317, 39317, 17676, 2407, 11051, 11051, 61978, 104427, 62763, 6326, 9781, 69942, 27390, 123204, 39508, 0, 10970, 39317, 101751, 69942, 4661, 123204, 18833, 74035, 65283, 321, 96877, 109566, 466, 466, 14, 52051, 6213, 40589, 57407, 17188, 369, 88768, 919, 6121, 6213, 51343, 1141, 198, 1245, 95872, 4275, 54322, 96738, 12261, 3460, 95140, 65493, 18563, 61909, 71702, 142878, 98339, 74035, 96348, 75581, 9781, 9781, 12317, 95940, 3221, 3000, 63068, 69942, 47160, 2407, 6977, 62763, 5077, 579, 321, 112464, 6213, 116044, 43314, 30, 11362, 11362, 112401, 18563, 321, 4022, 4022, 15392, 14791, 3221, 69942, 25586, 35132, 28520, 9781, 466, 5474, 37737, 7099, 104145, 8153, 14791, 39269, 98500, 8153, 12454, 41567, 198, 198, 1919, 279, 74035, 321, 271, 198, 760, 10092, 271, 760, 198, 13962, 271, 760, 2099, 4774, 264, 18268, 5072, 314, 264, 2526, 440, 3349, 11, 13, 561, 74035, 682, 3349, 6311, 321, 17024, 22162, 440, 12103, 3565, 52850, 383, 7695, 2094, 383, 1141, 3460, 321, 23469, 13, 11116, 3349, 513, 3349, 11, 6105, 11, 264, 264, 6105, 6105, 37747, 421, 25791, 680, 2272, 279, 8153, 61909, 13, 1070, 13, 561, 74035, 579, 23469, 369, 2526, 321, 17676, 11, 321, 1141, 682, 34036, 11, 34036, 39317, 382, 421, 494, 1141, 3008, 314, 1141, 3460, 13, 561, 23469, 369, 54166, 11, 864, 6326, 310, 279, 74035, 579, 39269, 11, 6866, 424, 279, 39742, 1406, 314, 279, 2099, 13, 561, 7830, 20158, 369, 799, 314, 3799, 22967, 321, 54322, 11, 248046, 248046, 248069]
token_diff: 0
token_diff (with loss): 0
[INFO:swift] End time of running main: 2026-06-26 16:19:27.791893
[rank0]:[W626 16:19:30.546778686 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the MiniCPM-V-4.6 (minicpmv4_6) model, including its configuration parsing, model registration, and bridge/vision implementations. The feedback points out potential dimension mismatch bugs when num_beams > 1 during inference, where pixel_values and pixel_values_videos are sliced to [:1] but their corresponding target_sizes and target_sizes_videos are not. Applying consistent slicing to the target sizes is recommended to prevent runtime errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +54 to +55
vision_output = self.model_cls.get_image_features(
self, pixel_values[:1].to(dtype=self.vision_tower.dtype), target_sizes)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

num_beams > 1(例如在使用 Beam Search 或 Batch Size > 1 进行推理/生成)时,pixel_values 被切片为 pixel_values[:1](即 Batch Size 为 1),但 target_sizes 却没有进行相应的切片。这会导致 get_image_features 内部因为输入维度不匹配而报错或产生错误结果。建议对 target_sizes 也进行 [:1] 切片,以保持与 pixel_values[:1] 的维度一致。

Suggested change
vision_output = self.model_cls.get_image_features(
self, pixel_values[:1].to(dtype=self.vision_tower.dtype), target_sizes)
vision_output = self.model_cls.get_image_features(
self, pixel_values[:1].to(dtype=self.vision_tower.dtype), target_sizes[:1] if target_sizes is not None else None)

Comment on lines +67 to +68
vision_output = self.model_cls.get_video_features(
self, pixel_values_videos[:1].to(dtype=self.vision_tower.dtype), target_sizes_videos)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

与上述图片处理逻辑类似,当 num_beams > 1 时,pixel_values_videos 被切片为 pixel_values_videos[:1],但 target_sizes_videos 却没有进行相应的切片。这会导致 get_video_features 内部因为输入维度不匹配而报错。建议对 target_sizes_videos 也进行 [:1] 切片,以保持维度一致。

Suggested change
vision_output = self.model_cls.get_video_features(
self, pixel_values_videos[:1].to(dtype=self.vision_tower.dtype), target_sizes_videos)
vision_output = self.model_cls.get_video_features(
self, pixel_values_videos[:1].to(dtype=self.vision_tower.dtype), target_sizes_videos[:1] if target_sizes_videos is not None else None)

@randydl

randydl commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

在transformers源码中就是这么写的,可以查看https://github.com/huggingface/transformers/blob/main/src/transformers/models/minicpmv4_6/modeling_minicpmv4_6.py#L740

if pixel_values is not None and self.config.image_token_id is not None:
    # Pixels are always `1` in first dim due to NaViT packing, and we don't
    # want to waste compute processing the same image `num_beams` times. Hack until
    # @raushan adds support for encoding images once same waay as in enc-dec models
    num_beams = pixel_values.shape[0]
    vision_output = self.get_image_features(pixel_values[:1], target_sizes, downsample_mode=downsample_mode)
    image_features = (
        torch.cat(vision_output.pooler_output, dim=0)
        .to(device=inputs_embeds.device, dtype=inputs_embeds.dtype)
        .repeat(num_beams, 1)
    )
    mask = self.get_placeholder_mask(input_ids, inputs_embeds, image_features, self.config.image_token_id)
    inputs_embeds = inputs_embeds.masked_scatter(mask, image_features)

if pixel_values_videos is not None and self.config.video_token_id is not None:
    num_beams = pixel_values_videos.shape[0]
    vision_output = self.get_video_features(
        pixel_values_videos[:1], target_sizes_videos, downsample_mode=downsample_mode
    )
    video_features = (
        torch.cat(vision_output.pooler_output, dim=0)
        .to(device=inputs_embeds.device, dtype=inputs_embeds.dtype)
        .repeat(num_beams, 1)
    )
    mask = self.get_placeholder_mask(input_ids, inputs_embeds, video_features, self.config.video_token_id)
    inputs_embeds = inputs_embeds.masked_scatter(mask, video_features)

@Jintao-Huang

Copy link
Copy Markdown
Collaborator

please run:

pip install pre-commit
pre-commit run --all-files

@Jintao-Huang Jintao-Huang merged commit 72b3df1 into modelscope:main Jun 28, 2026
1 check failed
@Jintao-Huang

Copy link
Copy Markdown
Collaborator

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants