How can we use te.Linear with weight parallel?

Hi developers,

Thanks for introducing such a great project that enables FP8 training.

In my training framework, we have a [weight parallel implementation](https://github.com/InternLM/InternEvo/blob/develop/internlm/model/modules/linear.py#L172-L316) that do weight all-gather and reduce-scatter like ZeRO3. From the weight parallel implementation we can find that in the forward pass, we all-gather weight do call the [linear_forward_op ](https://github.com/InternLM/InternEvo/blob/develop/internlm/model/modules/linear.py#L210)(which is actually `torch.nn.functional.Linear`).

But when I check the code of [te.Linear](https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/module/linear.py#L666), there is a `torch.autograd.Function` named [_Linear](https://github.com/NVIDIA/TransformerEngine/blob/main/transformer_engine/pytorch/module/linear.py#L64) that handles FP8 computation. 

So, I just wonder how can we integrate `te.Linear` with our weight parallel implementation? From my understanding, the forward op and backward op that used in our weight parallel implementation is dependent on `torch.nn.functional.Linear`, which is not compatible with the op that used in `te._Linear`.

Thanks in advance if anybody could provide some hints!


cc @ksivaman @timmoon10 @cyanguwa 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How can we use te.Linear with weight parallel? #1532

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How can we use te.Linear with weight parallel? #1532

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions