You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The attention gradient dQ calculation in TE is wrong.
When training GPT using Megatron with --transformer-impl transformer_engine on same parameters and data, the single card result of dQ and tensor-parallel result of dQ do not align. The relative error of dQ is as large as 0.5 for a 24 layer GPT model, and can reach 0.9 for a 128 layer GPT model.
Then when checking with the correct --transformer-impl local baseline implementation, both TE single card and TE tensor-parallel's dQ result has a relative error 1.1 compared with the baseline local's dQ result. They are all trained with the same parameters and data.
The calling path for Transformer Engine implementation is DotProductAttention.forward -> FusedAttnFunc.apply -> FusedAttnFunc.backward -> fused_attn_bwd.
To Reproduce
Run the following script for TE single card training. The TP version just adds torchrun-related distributed arguments and sets --tensor-model-parallel-size 2. The local baseline just sets --transformer-impl local.
The gradient of attention (dQ) in the TE implementation should be the same as single card when TP is on, and should be the same with the local implementation.
When it comes to bf16, the relative error of the correctly calculated tensor should be at the magnitude of 1e-2, but the current implementation produces wrong results whose relative error is more than 1 when comparing with correct results.
Stack trace/logs
If applicable, add the stack trace or logs from the time of the error.
Environment (please complete the following information):
Describe the bug
The attention gradient dQ calculation in TE is wrong.
When training GPT using Megatron with
--transformer-impl transformer_engine
on same parameters and data, the single card result of dQ and tensor-parallel result of dQ do not align. The relative error of dQ is as large as 0.5 for a 24 layer GPT model, and can reach 0.9 for a 128 layer GPT model.Then when checking with the correct
--transformer-impl local
baseline implementation, both TE single card and TE tensor-parallel's dQ result has a relative error 1.1 compared with the baselinelocal
's dQ result. They are all trained with the same parameters and data.The calling path for Transformer Engine implementation is
DotProductAttention.forward
->FusedAttnFunc.apply
->FusedAttnFunc.backward
->fused_attn_bwd
.To Reproduce
Run the following script for TE single card training. The TP version just adds torchrun-related distributed arguments and sets
--tensor-model-parallel-size 2
. Thelocal
baseline just sets--transformer-impl local
.Expected behavior
The gradient of attention (dQ) in the TE implementation should be the same as single card when TP is on, and should be the same with the
local
implementation.When it comes to bf16, the relative error of the correctly calculated tensor should be at the magnitude of 1e-2, but the current implementation produces wrong results whose relative error is more than 1 when comparing with correct results.
Stack trace/logs
If applicable, add the stack trace or logs from the time of the error.
Environment (please complete the following information):
11996c9f
pip install "transformer_engine[pytorch]"
)Proposed fix
See the function call path above.
Additional context
Please fix it soon. This is serious bug and has been for a long time.
The text was updated successfully, but these errors were encountered: