-
Notifications
You must be signed in to change notification settings - Fork 533
Open
Description
Is your feature request related to a problem? Please describe.
ulysess sp + ring attention gives a good performance in SFT/RL training, which is called hierarchical CP here. But it doesn't support qkv_format 'thd' for packing now. Packing sequence is also a way to gain a good throughput.
[rank0]: File "/opt/conda/lib/python3.10/site-packages/transformer_engine/pytorch/attention/dot_product_attention/backends.py", line 659, in forward
[rank0]: output = attn_forward_func_with_cp(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py", line 3619, in attn_forward_func_with_cp
[rank0]: out = AttnFuncWithCPAndKVP2P.apply(*args)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[rank0]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank0]: File "/opt/conda/lib/python3.10/site-packages/transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py", line 469, in forward
[rank0]: qkv_format != "thd"
[rank0]: AssertionError: thd format is not supported with hierarchical CP implementation yet!
platform H800
pytorch 2.7
megatron-lm branch core_r0.13.0
transformer_engine 2.4.0
Describe the solution you'd like
I'm not very clear for now.
Describe alternatives you've considered
Closing packing maybe solves the error, but it will influence loss convergence.
Additional context
Add any other context or screenshots about the feature request here.
Metadata
Metadata
Assignees
Labels
No labels