modelscope · Jintao-Huang · Nov 4, 2025 · Oct 17, 2025 · Oct 22, 2025 · Oct 23, 2025
diff --git a/README.md b/README.md
@@ -75,6 +75,7 @@ You can contact us and communicate with us by adding our group:
 
 
 ## 🎉 News
+- 🎁 2025.11.04: Support for [Mcore-Bridge](docs/source_en/Megatron-SWIFT/Mcore-Bridge.md), making Megatron training as simple and easy to use as transformers.
 - 🎁 2025.10.28: Ray [here](docs/source_en/Instruction/Ray.md).
 - 🎁 2025.10.28: Support [use yaml](examples/yaml) to configure command line parameters.
 - 🎁 2025.09.29: Support padding_free for embedding/reranker/seq_cls tasks, use `--padding_free true --task_type embedding/reranker/generative_reranker/seq_cls` to begin!

diff --git a/README_CN.md b/README_CN.md
@@ -71,6 +71,7 @@
 - **模型量化**：支持AWQ、GPTQ、FP8和BNB的量化导出，导出的模型支持使用vLLM/SGLang/LmDeploy推理加速，并支持继续训练。
 
 ## 🎉 新闻
+- 🎁 2025.11.04: 支持[Mcore-Bridge](docs/source/Megatron-SWIFT/Mcore-Bridge.md)，使Megatron训练像transformers一样简单易用。
 - 🎁 2025.10.28: Ray [已支持](docs/source/Instruction/ray的支持.md)。
 - 🎁 2025.10.28: 已支持[使用yaml](examples/yaml)配置命令行参数。
 - 🎁 2025.09.29: 支持embedding/reranker/seq_cls任务的padding_free参数, 使用`--padding_free true --task_type embedding/reranker/generative_reranker/seq_cls`开始训练!

diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -701,6 +701,7 @@ App参数继承于[部署参数](#部署参数), [Web-UI参数](#Web-UI参数)
 - mcore_adapters: mcore格式模型的adapter路径列表，默认为空列表。
 - thread_count: `--to_mcore true`时的模型切片数。默认为None，根据模型大小自动设置，使得最大分片小于10GB。
 - 🔥test_convert_precision: 测试HF和Megatron格式权重转换的精度误差。默认为False。
+- test_convert_dtype: 转换精度测试使用的dtype，默认为'float32'。
 - 🔥push_to_hub: 是否推送hub，默认为False。例子参考[这里](https://github.com/modelscope/ms-swift/blob/main/examples/export/push_to_hub.sh)。
 - hub_model_id: 推送的model_id，默认为None。
 - hub_private_repo: 是否是private repo，默认为False。
@@ -764,6 +765,7 @@ qwen2_5_omni除了包含qwen2_5_vl和qwen2_audio的模型特定参数外，还
 - SPATIAL_MERGE_SIZE: 默认为2。
 - IMAGE_MIN_TOKEN_NUM: 默认为`4`，代表一张图片最小图像tokens的个数。
 - 🔥IMAGE_MAX_TOKEN_NUM: 默认为`16384`，代表一张图片最大图像tokens的个数。（用于避免OOM）
+  - 提示：等价最大图像像素为`IMAGE_MAX_TOKEN_NUM * 32 *32`。
 - VIDEO_MIN_TOKEN_NUM: 默认为`128`，代表视频中一帧的最小视频tokens的个数。
 - 🔥VIDEO_MAX_TOKEN_NUM: 默认为`768`，代表视频中一帧的最大视频tokens的个数。（用于避免OOM）
 - MAX_RATIO: 默认为200。

diff --git a/docs/source/Megatron-SWIFT/Mcore-Bridge.md b/docs/source/Megatron-SWIFT/Mcore-Bridge.md
@@ -0,0 +1,275 @@
+# Mcore Bridge
+
+Megatron 以其卓越的训练速度和丰富的并行技术而著称，但也因此带来了较高的使用门槛。因此mcore-bridge 应运而生，旨在让 Megatron 训练像 transformers 一样简单易用。通过 Mcore-Bridge，用户可以：
+1. 直接加载 safetensors 格式的模型权重，无缝使用 Megatron 进行高效训练。直接保存 训练权重为 safetensors 格式，无需额外转换。
+2. 兼容 LoRA 增量权重的双向转换。
+3. 兼容GRPO/GKD等算法的`Megatron->vLLM`权重同步。
+4. 支持多机转换超大规模模型。
+
+Mcore-Bridge 兼容 Dense/MoE/多模态等多种模型架构。训练完成后，转换后的模型可直接使用 transformers、vLLM、SGLang 等主流推理框架部署。
+
+## 无缝训练
+目前Mcore-Bridge已支持TP/PP/EP/ETP/VPP等并行技术，支持所有Megatron-SWIFT支持的模型架构，参考[支持的模型文档](../Instruction/支持的模型和数据集.md)。以下介绍Mcore-Bridge的无缝训练能力，分别介绍Dense模型和Moe模型。
+
+### Dense模型
+以下为多模态模型Qwen3-VL模型训练的例子:
+```shell
+# 2 * 76GiB
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+NPROC_PER_NODE=2 \
+IMAGE_MAX_TOKEN_NUM=1024 \
+VIDEO_MAX_TOKEN_NUM=128 \
+FPS_MAX_FRAMES=16 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron sft \
+    --model Qwen/Qwen3-VL-8B-Instruct \
+    --load_safetensors true \
+    --save_safetensors true \
+    --dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite#5000' \
+    --load_from_cache_file true \
+    --tensor_model_parallel_size 2 \
+    --sequence_parallel true \
+    --packing true \
+    --freeze_llm false \
+    --freeze_vit true \
+    --freeze_aligner true \
+    --split_dataset_ratio 0.01 \
+    --micro_batch_size 1 \
+    --global_batch_size 4 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-5 \
+    --lr_warmup_fraction 0.05 \
+    --min_lr 1e-6 \
+    --max_epochs 1 \
+    --save megatron_output/Qwen3-VL-8B-Instruct \
+    --save_interval 200 \
+    --vit_gradient_checkpointing false \
+    --max_length 2048 \
+    --num_workers 4 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --dataset_num_proc 8
+```
+
+然后我们对验证集部分进行推理：
+```shell
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+IMAGE_MAX_TOKEN_NUM=1024 \
+VIDEO_MAX_TOKEN_NUM=128 \
+FPS_MAX_FRAMES=16 \
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --model megatron_output/Qwen3-VL-8B-Instruct/vx-xxx/checkpoint-xxx \
+    --load_data_args true \
+    --stream true
+```
+
+### Moe模型
+以下为纯文本模型Qwen3-Moe模型CoT训练的例子:
+
+```shell
+# 8 * 76GiB, 3s/it
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+NPROC_PER_NODE=8 \
+megatron sft \
+    --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+    --load_safetensors true \
+    --save_safetensors true \
+    --dataset 'swift/Chinese-Qwen3-235B-Thinking-2507-Distill-data-110k-SFT#20000' \
+    --load_from_cache_file true \
+    --split_dataset_ratio 0.01 \
+    --moe_permute_fusion true \
+    --pipeline_model_parallel_size 2 \
+    --decoder_first_pipeline_num_layers 25 \
+    --tensor_model_parallel_size 4 \
+    --expert_model_parallel_size 4 \
+    --moe_grouped_gemm true \
+    --moe_shared_expert_overlap true \
+    --moe_aux_loss_coeff 1e-6 \
+    --micro_batch_size 1 \
+    --global_batch_size 4 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --max_epochs 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-5 \
+    --lr_warmup_fraction 0.05 \
+    --min_lr 1e-6 \
+    --save megatron_output/Qwen3-30B-A3B-Instruct-2507 \
+    --eval_interval 500 \
+    --save_interval 500 \
+    --max_length 8192 \
+    --packing true \
+    --num_workers 8 \
+    --dataset_num_proc 8 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --sequence_parallel true \
+    --moe_expert_capacity_factor 2 \
+    --attention_backend flash
+```
+
+对训练后的权重进行推理：
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --model megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx \
+    --stream true \
+    --max_new_tokens 1024
+```
+
+## LoRA导出
+
+Mcore-Bridge除了支持全参数的导入导出，还支持单独对LoRA增量模型进行导入导出。
+
+以下为纯文本模型Qwen3-Moe模型使用LoRA自我认知训练的例子：
+- 若你希望导出merge后的权重，而不是LoRA增量权重，请设置`--merge_lora true`。
+- 注意：由于transformers和Megatron模型结构并不一定一致（例如transformers的Qwen3-VL-Moe的专家部分并不是Linear实现，而是Parameters），因此部分模型无法转换（若Qwen3-VL-Moe只设置linear_proj和linear_qkv训练LoRA也支持转换）。但大多数的模型支持LoRA转换，例如：Qwen3-Moe，Qwen3-Omni-Moe，GLM4.5-V等。
+```shell
+# 50GiB
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+NPROC_PER_NODE=2 \
+CUDA_VISIBLE_DEVICES=0,1 \
+megatron sft \
+    --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+    --load_safetensors true \
+    --save_safetensors true \
+    --merge_lora false \
+    --dataset 'swift/Chinese-Qwen3-235B-2507-Distill-data-110k-SFT#2000' \
+              'swift/self-cognition#1000' \
+    --load_from_cache_file true \
+    --train_type lora \
+    --lora_rank 8 \
+    --lora_alpha 32 \
+    --target_modules all-linear \
+    --split_dataset_ratio 0.01 \
+    --moe_permute_fusion true \
+    --expert_model_parallel_size 2 \
+    --moe_grouped_gemm true \
+    --moe_shared_expert_overlap true \
+    --moe_aux_loss_coeff 1e-3 \
+    --micro_batch_size 8 \
+    --global_batch_size 16 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --max_epochs 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-4 \
+    --lr_warmup_fraction 0.05 \
+    --min_lr 1e-5 \
+    --save megatron_output/Qwen3-30B-A3B-Instruct-2507 \
+    --eval_interval 200 \
+    --save_interval 200 \
+    --max_length 2048 \
+    --num_workers 8 \
+    --dataset_num_proc 8 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --sequence_parallel true \
+    --moe_expert_capacity_factor 2 \
+    --attention_backend flash \
+    --model_author swift \
+    --model_name swift-robot
+```
+
+对导出的LoRA权重进行推理：
+```shell
+CUDA_VISIBLE_DEVICES=0 \
+swift infer \
+    --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+    --adapters megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx/checkpoint-xxx \
+    --stream true
+```
+
+## 导出与转换精度测试
+
+Mcore-Bridge除了支持在训练中进行safetensors的转换和保存，也支持了`megatron export`命令用于单独的权重导出。`megatron export`支持在权重转换时，对转换精度进行测试，这在接入新模型时验证接入准确性很有帮助。通常，Megatron-SWIFT已经接入的模型不会出现精度不对齐的情况，你可以放心设置`--test_convert_precision false`。
+
+全参数权重：
+```shell
+# safetensors -> torch_dist
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NPROC_PER_NODE=4 \
+megatron export \
+    --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+    --save Qwen3-30B-A3B-Instruct-2507-mcore \
+    --to_mcore true \
+    --tensor_model_parallel_size 2 \
+    --expert_model_parallel_size 2 \
+    --pipeline_model_parallel_size 2 \
+    --test_convert_precision true
+```
+
+```shell
+# torch_dist -> safetensors
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NPROC_PER_NODE=4 \
+megatron export \
+    --load Qwen3-30B-A3B-Instruct-2507-mcore \
+    --save Qwen3-30B-A3B-Instruct-2507-hf \
+    --to_hf true \
+    --tensor_model_parallel_size 2 \
+    --expert_model_parallel_size 2 \
+    --pipeline_model_parallel_size 2 \
+    --test_convert_precision true
+```
+
+LoRA权重：
+```shell
+# torch_dist -> safetensors
+# 若你需要进行merge-lora，并测试merge-lora后的精度对齐，你只需要设置`--merge_lora true`即可
+# 你也可以将`--model safetensors-path`修改为`--load torch-dist-path`。这两种方式是等价的，mcore-bridge会自动处理。
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NPROC_PER_NODE=4 \
+megatron export \
+    --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx \
+    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-lora \
+    --merge_lora false \
+    --to_hf true \
+    --tensor_model_parallel_size 2 \
+    --expert_model_parallel_size 2 \
+    --pipeline_model_parallel_size 2 \
+    --test_convert_precision true
+```
+
+```shell
+# safetensors -> torch_dist
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NPROC_PER_NODE=4 \
+megatron export \
+    --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+    --adapters megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-lora \
+    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-mcore \
+    --merge_lora false \
+    --to_mcore true \
+    --tensor_model_parallel_size 2 \
+    --expert_model_parallel_size 2 \
+    --pipeline_model_parallel_size 2 \
+    --test_convert_precision true
+```
+
+Merge-LoRA:
+```shell
+# torch_dist -> torch_dist
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NPROC_PER_NODE=4 \
+megatron export \
+    --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
+    --adapter_load megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx \
+    --save megatron_output/Qwen3-30B-A3B-Instruct-2507/vx-xxx-merged \
+    --merge_lora true \
+    --to_mcore true \
+    --tensor_model_parallel_size 2 \
+    --expert_model_parallel_size 2 \
+    --pipeline_model_parallel_size 2
+```
diff --git a/docs/source/Megatron-SWIFT/命令行参数.md b/docs/source/Megatron-SWIFT/命令行参数.md
@@ -232,6 +232,7 @@ lora训练：
 - 🔥target_modules: 指定lora模块的后缀，例如：你可以设置为`--target_modules linear_qkv linear_proj`。默认为`['all-linear']`，代表将所有的linear设置为target_modules。
   - 注意：在LLM和多模态LLM中，'all-linear'的行为有所不同。若是LLM则自动寻找除lm_head外的linear并附加tuner；**若是多模态LLM，则默认只在LLM上附加tuner，该行为可以被`freeze_llm`、`freeze_vit`、`freeze_aligner`控制**。
   - 注意：若需要将所有的router设置为target_modules, 你可以额外设置`--target_modules all-router ...`，例如：`--target_modules all-router all-linear`。
+  - transformers和Megatron的Linear层后缀名称不同，在Megatron中，`linear_proj`代表`o_proj`，`linear_qkv`代表`q_proj, k_proj, v_proj`的拼接，`linear_fc1`代表`gate_proj`, `up_proj`的拼接，`linear_fc2`代表`down_proj`。
 - 🔥target_regex: 指定lora模块的regex表达式，默认为`None`。如果该值传入，则target_modules参数失效。
 - 🔥modules_to_save: 在已附加tuner后，额外指定一部分原模型模块参与训练和存储。默认为`[]`。例如设置为`--modules_to_save word_embeddings output_layer`，在LoRA训练中解开`word_embeddings`和`output_layer`层进行训练，这两部分的权重信息最终会进行保存。
 - 🔥lora_rank: 默认为`8`。
@@ -263,6 +264,13 @@ lora训练：
 **RM参数**:
 - center_rewards_coefficient: 用于激励奖励模型输出均值为零的奖励的系数，具体查看这篇[论文](https://huggingface.co/papers/2312.09244)。推荐值：0.01。
 
+**Mcore-Bridge参数**
+- 🔥load_safetensors: 默认为False，是否直接从safetensors加载权重。
+- 🔥save_safetensors: 默认为False，是否直接保存成safetensors权重。注意，若该参数设置为True，则不会存储优化器权重、随机数状态等断点续训内容。
+- model: safetensors权重的model_id或者model_path。默认为None。
+- adapters: safetensors格式的LoRA增量权重的adapter_id或者adapter_path。默认为`[]`。
+- merge_lora: 是否存储合并后的权重。默认为None，若`save_safetensors`设置为True，该参数默认值为`True`，否则为False。即默认情况下，存储为safetensors格式时会合并LoRA；存储为torch_dist格式时，不会合并LoRA。
+- max_shard_size: safetensors格式存储文件最大大小，默认'5GB'。
 
 ## 训练参数
 
@@ -299,3 +307,14 @@ Megatron训练参数继承自Megatron参数和基本参数（**与ms-swift共用
 - 🔥rlhf_type: 默认为'dpo'。目前可选择为'dpo'、'kto'和'rm'。
 - loss_scale: 覆盖[基本参数](../Instruction/命令行参数.md)中的loss_scale。默认为'last_round'。
 - calculate_per_token_loss: 覆盖Megatron参数，默认为False。
+
+
+## 导出参数
+这里介绍`megatron export`的参数（需"ms-swift>=3.10"），若要使用`swift export`导出命令，请参考[ms-swift命令行参数文档](../Instruction/命令行参数.md#导出参数)。`megatron export`相比`swift export`，支持分布式和多机导出。Megatron导出参数继承自Megatron参数和基本参数。
+- 🔥to_mcore: HF格式权重转成Megatron格式。默认为False。
+- 🔥to_hf: Megatron格式权重转成HF格式。默认为False。
+- 🔥merge_lora: 默认为None，若`to_hf`设置为True，该参数默认值为`True`，否则为False。即默认情况下，存储为safetensors格式时会合并LoRA；存储为torch_dist格式时，不会合并LoRA。合并后的权重存储在`--save`目录下。
+  - 注意：由于transformers和Megatron模型结构并不一定一致（例如transformers的Qwen3-VL-Moe的专家部分并不是Linear实现，而是Parameters），因此部分模型无法转换（若Qwen3-VL-Moe只设置linear_proj和linear_qkv训练LoRA也支持转换）。但大多数的模型支持LoRA转换，例如：Qwen3-Moe，Qwen3-Omni-Moe，GLM4.5-V等。
+- 🔥test_convert_precision: 测试HF和Megatron格式权重转换的精度误差。默认为False。
+- test_convert_dtype: 转换精度测试使用的dtype，默认为'float32'。
+- exist_ok: 如果`args.save`存在，不抛出异常，进行覆盖。默认为False。
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -41,6 +41,7 @@ Swift DOCUMENTATION
    Megatron-SWIFT/命令行参数.md
    Megatron-SWIFT/LoRA训练.md
    Megatron-SWIFT/多模态模型.md
+   Megatron-SWIFT/Mcore-Bridge.md
 
 .. toctree::
    :maxdepth: 2

diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -719,6 +719,7 @@ Export Arguments include the [basic arguments](#base-arguments) and [merge argum
 - mcore_adapters: List of paths to mcore format model adapters, default is empty list.
 - thread_count: The number of model slices when `--to_mcore true` is set. Defaults to None, and is automatically configured based on the model size, ensuring that the largest slice is less than 10GB.
 - 🔥test_convert_precision: Test the precision error when converting weights between HF and Megatron formats. Default is False.
+- test_convert_dtype: The dtype used for conversion precision testing, defaults to 'float32'.
 - 🔥push_to_hub: Whether to push to the hub, with the default being False. Examples can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/export/push_to_hub.sh).
 - hub_model_id: Model ID for pushing, default is None.
 - hub_private_repo: Whether it is a private repo, default is False.
@@ -786,6 +787,7 @@ The parameter meanings are the same as in the `qwen_vl_utils>=0.0.14` library
 - SPATIAL_MERGE_SIZE: default 2.
 - IMAGE_MIN_TOKEN_NUM: default `4`, denotes the minimum number of image tokens per image.
 - 🔥IMAGE_MAX_TOKEN_NUM: default `16384`, denotes the maximum number of image tokens per image. (used to avoid OOM)
+  - Note: The equivalent maximum image pixel count is `IMAGE_MAX_TOKEN_NUM * 32 * 32`.
 - VIDEO_MIN_TOKEN_NUM: default `128`, denotes the minimum number of video tokens per frame.
 - 🔥VIDEO_MAX_TOKEN_NUM: default `768`, denotes the maximum number of video tokens per frame. (used to avoid OOM)
 - MAX_RATIO: default 200.