Skip to content

thd format is not supported with hierarchical CP implementation yet #2208

@stormchasingg

Description

@stormchasingg

Is your feature request related to a problem? Please describe.
ulysess sp + ring attention gives a good performance in SFT/RL training, which is called hierarchical CP here. But it doesn't support qkv_format 'thd' for packing now. Packing sequence is also a way to gain a good throughput.

[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformer_engine/pytorch/attention/dot_product_attention/backends.py", line 659, in forward
[rank0]:     output = attn_forward_func_with_cp(
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py", line 3619, in attn_forward_func_with_cp
[rank0]:     out = AttnFuncWithCPAndKVP2P.apply(*args)
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:   File "/opt/conda/lib/python3.10/site-packages/transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py", line 469, in forward
[rank0]:     qkv_format != "thd"
[rank0]: AssertionError: thd format is not supported with hierarchical CP implementation yet!

platform H800
pytorch 2.7
megatron-lm branch core_r0.13.0
transformer_engine 2.4.0
Describe the solution you'd like

I'm not very clear for now.

Describe alternatives you've considered

Closing packing maybe solves the error, but it will influence loss convergence.

Additional context

Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions