【开源实习】基于MindSpore NLP实现language modeling应用案例开发 #26

4everWZ · 2025-12-12T13:21:31Z

基于 MindSpore MindNLP 的语言模型微调实战（DistilGPT-2 Causal LM）

本 Notebook 演示如何在 MindSpore 2.7.0 和 MindNLP (Mindhf) 0.5.1 环境下，参考 HuggingFace 官方 language_modeling 流程，对 DistilGPT-2 模型进行微调，并完成文本生成推理。

与官方示例不同的是，本项目针对 MindSpore 静态图/动态图特性，对 Trainer 进行了轻量级适配，使其既能享受 HF 风格的 API 便利性，又能稳定运行于 Ascend/GPU 环境。

主要内容

数据集加载：
- 使用 MindNLP 的 load_dataset 接口从 HuggingFace Datasets 镜像源加载 Wikitext-2-raw-v1 数据集。
模型与分词器：
- 使用 AutoTokenizer 和 AutoModelForCausalLM 加载轻量级的 DistilGPT-2 预训练模型。
- 处理 GPT-2 家族特有的 padding 问题（将 pad_token 设为 eos_token）。
数据处理与适配：
- 构造 Causal LM 训练样本（input_ids 与 labels 同步偏移）。
- 关键适配：实现 MSMapDataset 类，将 MindSpore 的流式 Dataset 封装为 Map-style 格式，并配合 passthrough_collator，解决 Trainer 内部 DataLoader 的兼容性问题。
模型微调（自定义 Trainer）：
- 未使用底层的原生循环，而是继承并实现 NoJitTrainer。
- 核心逻辑：重写 training_step，移除默认的 JIT 编译依赖，采用显式 .backward() 方式更新梯度。这解决了 value_and_grad 在部分复杂控制流下的兼容性报错，同时保留了 TrainingArguments 的便捷配置（如 logging_steps、gradient_accumulation 等）。
推理与生成：
- 使用 model.generate 接口进行 Temperature 采样与 Top-P 解码，实现高质量的文本自动续写。

Copilot

Pull request overview

This PR introduces a comprehensive Jupyter notebook tutorial demonstrating language model fine-tuning using MindSpore 2.7.0 and MindNLP 0.5.1. The tutorial adapts HuggingFace's language modeling example to the MindSpore ecosystem, showcasing GPT-2 fine-tuning on the Wikitext-2 dataset.

Key changes:

Complete implementation of GPT-2 Causal Language Model fine-tuning workflow
Custom training loop using MindSpore's native APIs with MindNLP Trainer integration
Text generation capabilities demonstrating the fine-tuned model's performance

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nlp/language_modeling/language_modeling.ipynb

llm/distilgpt2/finetune_distilgpt2_language_modeling.ipynb

4everWZ · 2025-12-14T13:24:05Z

已经在Ascend硬件中finetune过了，已经更新pr了
等老师review后，会去squash

moyu026 · 2025-12-16T06:23:49Z

使用mindspore2.7.0和mindnlp0.5.1可以运行，但是需要修改diffusers版本，pip install diffusers==0.35.2

4everWZ · 2025-12-16T10:29:42Z

使用mindspore2.7.0和mindnlp0.5.1可以运行，但是需要修改diffusers版本，pip install diffusers==0.35.2

是的，diffuser的版本在mindnlp有一个issue有提到过这个问题。
mindspore-lab/mindnlp#2306

moyu026 · 2025-12-18T01:17:35Z

请参考这个模板再修改下格式，https://github.com/BQBBLZ/applications/blob/master/ai_x/train_resnet_classification.ipynb

4everWZ · 2025-12-19T09:25:44Z

请参考这个模板再修改下格式，https://github.com/BQBBLZ/applications/blob/master/ai_x/train_resnet_classification.ipynb

您好，已经修改格式了，请您检查一下

BQBBLZ · 2025-12-29T07:32:22Z

notebook的中间运行结果产物也需要删除，原因如下：

考虑到不同人运行 Notebook 会产生不同的execution_count、输出结果，提交后容易导致 Git 冲突；且Notebook 的输出（如图片、大段文本）会让文件体积变大，建议提交notebook前，通过nbstripout剔除运行时产生的动态数据，仅保留代码、Markdown 文本结构，让文件更干净

具体操作：
fork并clone代码仓后，执行：
pip install nbstripout
nbstripout --install
上述操作后，后续每一次提交时，Git 都会自动调用nbstripout清理.ipynb文件的过程数据

也可以去此链接中查看全部的操作过程：https://github.com/mindspore-lab/applications/wiki/Contributing-Guidelines

xing-yiren · 2025-12-30T16:57:40Z

文件中涉及MindNLP名称建议改成全称MindSpore NLP
这个案例放到llm目录下吧，把文件名改成模型名称（模型名称可以参考huggingface transformers库）的命名方法，同步更新下目录，辛苦了
ipynb文件名参考wiki的规范修改下

4everWZ · 2025-12-31T08:23:06Z

notebook的中间运行结果产物也需要删除，原因如下：

考虑到不同人运行 Notebook 会产生不同的execution_count、输出结果，提交后容易导致 Git 冲突；且Notebook 的输出（如图片、大段文本）会让文件体积变大，建议提交notebook前，通过nbstripout剔除运行时产生的动态数据，仅保留代码、Markdown 文本结构，让文件更干净

具体操作： fork并clone代码仓后，执行： pip install nbstripout nbstripout --install 上述操作后，后续每一次提交时，Git 都会自动调用nbstripout清理.ipynb文件的过程数据

也可以去此链接中查看全部的操作过程：https://github.com/mindspore-lab/applications/wiki/Contributing-Guidelines

已经删除输出结果了

4everWZ · 2025-12-31T08:27:46Z

文件中涉及MindNLP名称建议改成全称MindSpore NLP

这个案例放到llm目录下吧，把文件名改成模型名称（模型名称可以参考huggingface transformers库）的命名方法，同步更新下目录，辛苦了

ipynb文件名参考wiki的规范修改下

已经将markdown中不影响导入安装的部分改成MindSpore NLP了，以及文件位置以及移动到llm/distilgpt2/finetune_distilgpt2_language_modeling.ipynb下了
已经rebase head了只有一次commit记录

llm/distilgpt2/finetune_distilgpt2_language_modeling.ipynb

xing-yiren · 2025-12-31T08:56:28Z

不过还是要重点夸夸，能用AI辅助代码检查，同时还能rebase head保证只有一次commit，优秀！

llm/distilgpt2/finetune_distilgpt2_language_modeling.ipynb

xing-yiren · 2026-01-07T03:07:40Z

README辛苦同步更新下案例信息，其他的没有了，辛苦辛苦

xing-yiren · 2026-01-07T03:08:17Z

@moyu026 登金辛苦也同步验收下动态图下运行是否没问题

moyu026 · 2026-01-07T07:27:21Z

动态图可以运行

xing-yiren · 2026-01-09T02:13:36Z

目前我这边看没什么问题了，辛苦 @wang-hua-2019 把把关

Copilot AI review requested due to automatic review settings December 12, 2025 13:21

Copilot started reviewing on behalf of 4everWZ December 12, 2025 13:21 View session

Copilot AI reviewed Dec 12, 2025

View reviewed changes

4everWZ force-pushed the upstream branch from 46cd8a5 to ef6f1ab Compare December 12, 2025 16:40

4everWZ force-pushed the upstream branch from de3e868 to 4f13fd4 Compare December 19, 2025 09:21

xing-yiren requested a review from wang-hua-2019 December 30, 2025 17:06

4everWZ force-pushed the upstream branch 2 times, most recently from 244fa04 to d95f76e Compare December 31, 2025 08:21

4everWZ force-pushed the upstream branch from d95f76e to 497163d Compare December 31, 2025 08:26

xing-yiren reviewed Dec 31, 2025

View reviewed changes

llm/distilgpt2/finetune_distilgpt2_language_modeling.ipynb Outdated Show resolved Hide resolved

llm/distilgpt2/finetune_distilgpt2_language_modeling.ipynb Outdated Show resolved Hide resolved

4everWZ force-pushed the upstream branch from 7f8388d to 73b65e2 Compare December 31, 2025 10:28

xing-yiren reviewed Jan 7, 2026

View reviewed changes

llm/distilgpt2/finetune_distilgpt2_language_modeling.ipynb Show resolved Hide resolved

llm/distilgpt2/finetune_distilgpt2_language_modeling.ipynb Outdated Show resolved Hide resolved

language_modeling fine-tuning in mindnlp

2fee6e4

4everWZ force-pushed the upstream branch from 73b65e2 to 2fee6e4 Compare January 7, 2026 08:54

implement llm/README.MD with DistilGPT2

889e0d8

4everWZ force-pushed the upstream branch from db4fa33 to 889e0d8 Compare January 7, 2026 13:49

xing-yiren approved these changes Jan 9, 2026

View reviewed changes

【开源实习】基于MindSpore NLP实现language modeling应用案例开发 #26

Are you sure you want to change the base?

【开源实习】基于MindSpore NLP实现language modeling应用案例开发 #26

Uh oh!

Conversation

4everWZ commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

基于 MindSpore MindNLP 的语言模型微调实战（DistilGPT-2 Causal LM）

主要内容

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

4everWZ commented Dec 14, 2025

Uh oh!

moyu026 commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

4everWZ commented Dec 16, 2025

Uh oh!

moyu026 commented Dec 18, 2025

Uh oh!

4everWZ commented Dec 19, 2025

Uh oh!

BQBBLZ commented Dec 29, 2025

Uh oh!

xing-yiren commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

4everWZ commented Dec 31, 2025

Uh oh!

4everWZ commented Dec 31, 2025

Uh oh!

Uh oh!

Uh oh!

xing-yiren commented Dec 31, 2025

Uh oh!

Uh oh!

Uh oh!

xing-yiren commented Jan 7, 2026

Uh oh!

xing-yiren commented Jan 7, 2026

Uh oh!

moyu026 commented Jan 7, 2026

Uh oh!

xing-yiren commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

4everWZ commented Dec 12, 2025 •

edited

Loading

moyu026 commented Dec 16, 2025 •

edited

Loading

xing-yiren commented Dec 30, 2025 •

edited

Loading