Fix PP clip_grad_norm #649

zijian-hu · 2024-10-24T01:50:28Z

This PR fixes the pipeline parallel incomplete gradient norm issue. Refer to #596 for more details.

tianyu-l

It may not be easy / feasible, but can we find a way to

first get global total_norm
then make a function call to (some variant of) the torch.nn.utils.clip_grad_norm_ with the global total_norm

The idea is to avoid the largely repeated code in pytorch.

train.py

torchtitan/clip_grad_nrom.py

awgu

sorry requesting changes since total_norm is computed wrt. parameters instead of gradients

torchtitan/clip_grad_nrom.py

awgu · 2024-11-11T15:15:31Z

torchtitan/clip_grad_nrom.py

+        total_norm **= norm_type
+        dist.all_reduce(total_norm, op=dist.ReduceOp.SUM, group=pp_mesh.get_group())
+        total_norm **= 1.0 / norm_type


This is only correct for p-norm where p is int/float (e.g. this does not work for inf-norm).

Intuitively, we should somehow allow us to change total_norm to have _NormPartial(norm_type) placement over the PP mesh, and we should redistribute it to Replicate() to reuse DTensor's own logic. cc: @tianyu-l @wz337 @XilunWu is there any way to do this?

@awgu in torchtitan, PP is done by first manually splitting the Llama model then wrap in PP wrapper. If my understanding is correct, none of them converts models weights into DTensor with PP mesh placement

For now, will just add a if-statement to check for inf norm

PP is done by first manually splitting the Llama model then wrap in PP wrapper. If my understanding is correct, none of them converts models weights into DTensor with PP mesh placement

this is true, however after splitting the llama model we do also apply SPMD parallelisms (TP and DP) which will use DTensor for parameters. So by the time we are clipping norms, parameters would be DTensors. (if we use PP only and no TP/DP, then there would be no DTensor parameters)

@wconstab You are right about SPMD (TP and/or DP) but the DTensor in that case has placement on TP/DP mesh not on PP mesh. We still need to do some form of reduce either explicitly (same as my code) or wrap the total_norm in DTensor on PP mesh (maybe via DTensor.redistribute?)

torchtitan/clip_grad_nrom.py

torchtitan/utils.py

zijian-hu · 2024-11-14T20:05:10Z

torchtitan/utils.py

+
+    if pp_mesh is not None:
+        if isinstance(total_norm, DTensor):
+            # will reach here if PP + other parallelism is used. If only using PP, total_norm will be a local tensor


@wconstab the if statement are now separated. Please let me know if it make sense to you

It looks good, thanks!

H-Huang

Looks good to me!

H-Huang · 2024-11-15T18:08:57Z

torchtitan/utils.py

+            # if total_norm is a DTensor, the placements must be `torch.distributed._tensor.ops.math_ops._NormPartial`
+            # we can simply reduce the DTensor to get the total norm in this tensor's process group
+            # and then convert it to a local tensor
+            total_norm = total_norm.redistribute(


nit: use total_norm.full_tensor() instead?

@H-Huang Good suggestion. Updated!

zijian-hu · 2024-11-15T23:48:12Z

sorry requesting changes since total_norm is computed wrt. parameters instead of gradients

@awgu please let me know if the update make sense and feel free to resolve the change request if you believe it is good enough

awgu

sounds good!

This PR fixes the pipeline parallel incomplete gradient norm issue. Refer to pytorch#596 for more details.

Fix PP clip_grad_nrom

81fce7c

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 24, 2024

zijian-hu mentioned this pull request Oct 24, 2024

Gradient norm clipping with pipeline parallelism (PP) #596

Closed

tianyu-l reviewed Oct 24, 2024

View reviewed changes

train.py Outdated Show resolved Hide resolved

zijian-hu added 2 commits October 23, 2024 23:25

use itertools.chain for clip_grad_norm_ inputs

b7bbd53

simplify clip_grad_norm_ inputs

b0e43e8

wconstab reviewed Oct 24, 2024

View reviewed changes

torchtitan/clip_grad_nrom.py Outdated Show resolved Hide resolved

wconstab changed the title ~~Fix PP clip_grad_nrom~~ Fix PP clip_grad_norm Oct 31, 2024

zijian-hu mentioned this pull request Nov 1, 2024

Seperate grad norm computation from torch.nn.utils.clip_grad_norm_ pytorch/pytorch#139467

Closed

update clip_grad_norm_ with newest PyTorch clip norm API

1229564

awgu requested changes Nov 11, 2024

View reviewed changes

tianyu-l reviewed Nov 12, 2024

View reviewed changes

torchtitan/clip_grad_nrom.py Outdated Show resolved Hide resolved

zijian-hu added 3 commits November 14, 2024 02:40

move clip_grad_norm_ to utils

59f2bf1

support inf-norm for clip_grad_norm_

70251ed

fix get_total_norm in clip_grad_norm_

3438147

wconstab reviewed Nov 14, 2024

View reviewed changes

torchtitan/utils.py Outdated Show resolved Hide resolved

fix clip_grad_norm_ PP reduce logic

30a0fc4

zijian-hu commented Nov 14, 2024

View reviewed changes

H-Huang approved these changes Nov 15, 2024

View reviewed changes

cleanup clip_grad_norm_

a34257d

awgu approved these changes Nov 16, 2024

View reviewed changes

H-Huang merged commit 046de56 into pytorch:main Nov 16, 2024
5 checks passed

mori360 pushed a commit to mori360/torchtitan that referenced this pull request Nov 26, 2024

Fix PP clip_grad_norm (pytorch#649)

7db6f5e

This PR fixes the pipeline parallel incomplete gradient norm issue. Refer to pytorch#596 for more details.

Fix PP clip_grad_norm #649

Fix PP clip_grad_norm #649

Uh oh!

Conversation

zijian-hu commented Oct 24, 2024

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zijian-hu Nov 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zijian-hu commented Nov 15, 2024

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zijian-hu Nov 13, 2024 •

edited

Loading