Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Wrong attention gradient in Transformer Engine #1615

Open
i-love-megatron opened this issue Mar 26, 2025 · 4 comments
Open

[BUG] Wrong attention gradient in Transformer Engine #1615

i-love-megatron opened this issue Mar 26, 2025 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@i-love-megatron
Copy link

Describe the bug

The attention gradient dQ calculation in TE is wrong.

When training GPT using Megatron with --transformer-impl transformer_engine on same parameters and data, the single card result of dQ and tensor-parallel result of dQ do not align. The relative error of dQ is as large as 0.5 for a 24 layer GPT model, and can reach 0.9 for a 128 layer GPT model.

Then when checking with the correct --transformer-impl local baseline implementation, both TE single card and TE tensor-parallel's dQ result has a relative error 1.1 compared with the baseline local's dQ result. They are all trained with the same parameters and data.

The calling path for Transformer Engine implementation is DotProductAttention.forward -> FusedAttnFunc.apply -> FusedAttnFunc.backward -> fused_attn_bwd.

To Reproduce

Run the following script for TE single card training. The TP version just adds torchrun-related distributed arguments and sets --tensor-model-parallel-size 2. The local baseline just sets --transformer-impl local.

ARGS="
    --tensor-model-parallel-size 1 \
    --pipeline-model-parallel-size 1 \
    --context-parallel-size 1 \
    --num-layers 24 \
    --hidden-size 1024 \
    --num-attention-heads 16 \
    --seq-length 1024 \
    --max-position-embeddings 1024 \
    --attention-dropout 0.0 \
    --hidden-dropout 0.0 \
    --attention-softmax-in-fp32 \
    --bf16 \
    --clip-grad 1.0 \
    --micro-batch-size 4 \
    --global-batch-size 16 \
    --lr 0.00015 \
    --min-lr 1.0e-5 \
    --train-iters 1 \
    --lr-warmup-fraction 0.01 \
    --lr-decay-iters 320000 \
    --lr-decay-style cosine \
    --weight-decay 1e-2 \
    --use-mcore-models \
    --no-gradient-accumulation-fusion \
    --transformer-impl transformer_engine \
    --data-path /workspace/dataset/wikitext_text_document \
    --vocab-file /workspace/dataset/gpt2-vocab.json \
    --merge-file /workspace/dataset/gpt2-merges.txt \
    --split 949,50,1 \
    --log-interval 100 \
    --save-interval 10000 \
    --eval-interval 1000 \
    --eval-iters 0 \
    --save /workspace/checkpoints \
    --load /workspace/checkpoints
"

torchrun \
/home/ubuntu/repos/Megatron-LM/pretrain_gpt.py \
$ARGS

Expected behavior

The gradient of attention (dQ) in the TE implementation should be the same as single card when TP is on, and should be the same with the local implementation.

When it comes to bf16, the relative error of the correctly calculated tensor should be at the magnitude of 1e-2, but the current implementation produces wrong results whose relative error is more than 1 when comparing with correct results.

Stack trace/logs

If applicable, add the stack trace or logs from the time of the error.

Environment (please complete the following information):

  • Megatron-LM commit ID: 11996c9f
  • PyTorch version: 2.5.1+cu124
  • CUDA version: 12.4
  • NCCL version: 2.21.5
  • TransformerEngine: both 1.13.0 and 2.1.0 from pip install has this problem (pip install "transformer_engine[pytorch]")
  • cuDNN: 9.8.0.87-1

Proposed fix
See the function call path above.

Additional context
Please fix it soon. This is serious bug and has been for a long time.

@i-love-megatron i-love-megatron added the bug Something isn't working label Mar 26, 2025
@i-love-megatron
Copy link
Author

@ksivaman

@ptrendx
Copy link
Member

ptrendx commented Mar 28, 2025

@cyanguwa Could you take a look?

@cyanguwa
Copy link
Collaborator

Which GPU architecture is this?

@i-love-megatron
Copy link
Author

I was using A6000 GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants