[BUG] Wrong attention gradient in Transformer Engine #1615

i-love-megatron · 2025-03-26T03:06:54Z

Describe the bug

The attention gradient dQ calculation in TE is wrong.

When training GPT using Megatron with --transformer-impl transformer_engine on same parameters and data, the single card result of dQ and tensor-parallel result of dQ do not align. The relative error of dQ is as large as 0.5 for a 24 layer GPT model, and can reach 0.9 for a 128 layer GPT model.

Then when checking with the correct --transformer-impl local baseline implementation, both TE single card and TE tensor-parallel's dQ result has a relative error 1.1 compared with the baseline local's dQ result. They are all trained with the same parameters and data.

The calling path for Transformer Engine implementation is DotProductAttention.forward -> FusedAttnFunc.apply -> FusedAttnFunc.backward -> fused_attn_bwd.

To Reproduce

Run the following script for TE single card training. The TP version just adds torchrun-related distributed arguments and sets --tensor-model-parallel-size 2. The local baseline just sets --transformer-impl local.

ARGS="
    --tensor-model-parallel-size 1 \
    --pipeline-model-parallel-size 1 \
    --context-parallel-size 1 \
    --num-layers 24 \
    --hidden-size 1024 \
    --num-attention-heads 16 \
    --seq-length 1024 \
    --max-position-embeddings 1024 \
    --attention-dropout 0.0 \
    --hidden-dropout 0.0 \
    --attention-softmax-in-fp32 \
    --bf16 \
    --clip-grad 1.0 \
    --micro-batch-size 4 \
    --global-batch-size 16 \
    --lr 0.00015 \
    --min-lr 1.0e-5 \
    --train-iters 1 \
    --lr-warmup-fraction 0.01 \
    --lr-decay-iters 320000 \
    --lr-decay-style cosine \
    --weight-decay 1e-2 \
    --use-mcore-models \
    --no-gradient-accumulation-fusion \
    --transformer-impl transformer_engine \
    --data-path /workspace/dataset/wikitext_text_document \
    --vocab-file /workspace/dataset/gpt2-vocab.json \
    --merge-file /workspace/dataset/gpt2-merges.txt \
    --split 949,50,1 \
    --log-interval 100 \
    --save-interval 10000 \
    --eval-interval 1000 \
    --eval-iters 0 \
    --save /workspace/checkpoints \
    --load /workspace/checkpoints
"

torchrun \
/home/ubuntu/repos/Megatron-LM/pretrain_gpt.py \
$ARGS

Expected behavior

The gradient of attention (dQ) in the TE implementation should be the same as single card when TP is on, and should be the same with the local implementation.

When it comes to bf16, the relative error of the correctly calculated tensor should be at the magnitude of 1e-2, but the current implementation produces wrong results whose relative error is more than 1 when comparing with correct results.

Stack trace/logs

If applicable, add the stack trace or logs from the time of the error.

Environment (please complete the following information):

Megatron-LM commit ID: 11996c9f
PyTorch version: 2.5.1+cu124
CUDA version: 12.4
NCCL version: 2.21.5
TransformerEngine: both 1.13.0 and 2.1.0 from pip install has this problem (pip install "transformer_engine[pytorch]")
cuDNN: 9.8.0.87-1

Proposed fix
See the function call path above.

Additional context
Please fix it soon. This is serious bug and has been for a long time.

The text was updated successfully, but these errors were encountered:

i-love-megatron · 2025-03-27T13:06:14Z

@ksivaman

ptrendx · 2025-03-28T05:25:17Z

@cyanguwa Could you take a look?

cyanguwa · 2025-03-28T05:32:31Z

Which GPU architecture is this?

i-love-megatron · 2025-03-31T22:35:24Z

I was using A6000 GPUs.

i-love-megatron added the bug Something isn't working label Mar 26, 2025

ptrendx assigned cyanguwa Mar 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Wrong attention gradient in Transformer Engine #1615

[BUG] Wrong attention gradient in Transformer Engine #1615

i-love-megatron commented Mar 26, 2025

i-love-megatron commented Mar 27, 2025

ptrendx commented Mar 28, 2025

cyanguwa commented Mar 28, 2025

i-love-megatron commented Mar 31, 2025

[BUG] Wrong attention gradient in Transformer Engine #1615

[BUG] Wrong attention gradient in Transformer Engine #1615

Comments

i-love-megatron commented Mar 26, 2025

i-love-megatron commented Mar 27, 2025

ptrendx commented Mar 28, 2025

cyanguwa commented Mar 28, 2025

i-love-megatron commented Mar 31, 2025