Does TransformerEngine support FP8 communication such like all-gather or all-to-all? #1579

zigzagcai · 2025-03-14T10:34:06Z

In MoE model architectures, especially when the model size is quite large. We found the throughput is limited by communication (all-gather / reduce-scatter / all-to-all). Where all-gather and reduce-scatter mainly used in ZeRO3 or FSDP, all-to-all mainly used in expert parallelism. The communication is quite large, and finally becomes the bottleneck.

We found another FP8 library named torchao has FP8 all-gather communication enabled. But I cannot find the similar FP8 communication API provided in TE.
So, does TransformerEngine support FP8 communication suck like all-gather/reduce-scatter or all-to-all?

The text was updated successfully, but these errors were encountered:

BestJuly · 2025-03-17T10:18:18Z

I think fp8 all-gather should be already supported in TE (_all_gather_fp8).

timmoon10 · 2025-03-17T18:41:40Z

It depends on the type of communication. For FP8 with delayed scaling:

Tensor-parallel communication: all-gather in FP8 (see _all_gather_fp8), reduce-scatter in BF16 (see reduce_scatter_along_first_dim)
PyTorch FSDP: param all-gather in FP8 (see _fsdp_gather_tensors), grad reduce-scatter in BF16
Megatron-LM distributed optimizer: param all-gather in FP8, grad reduce-scatter in BF16 (see DistributedOptimizer)
Megatron-LM MoE token dispatcher (see MoEAlltoAllTokenDispatcher): all-to-all in BF16 (see _AllToAll)

zigzagcai · 2025-04-01T03:42:21Z

It depends on the type of communication. For FP8 with delayed scaling:

Tensor-parallel communication: all-gather in FP8 (see _all_gather_fp8), reduce-scatter in BF16 (see reduce_scatter_along_first_dim)

PyTorch FSDP: param all-gather in FP8 (see _fsdp_gather_tensors), grad reduce-scatter in BF16

Megatron-LM distributed optimizer: param all-gather in FP8, grad reduce-scatter in BF16 (see DistributedOptimizer)

Megatron-LM MoE token dispatcher (see MoEAlltoAllTokenDispatcher): all-to-all in BF16 (see _AllToAll)

Thank you! @timmoon10 @BestJuly

Just another question, does TE has plans to support FP8 all-to-all like what DeepEP has done?

yaox12 changed the title ~~Does TransformerEngine support FP8 communication suck like all-gather or all-to-all?~~ Does TransformerEngine support FP8 communication such like all-gather or all-to-all? Mar 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does TransformerEngine support FP8 communication such like all-gather or all-to-all? #1579

Does TransformerEngine support FP8 communication such like all-gather or all-to-all? #1579

zigzagcai commented Mar 14, 2025 •

edited

Loading

BestJuly commented Mar 17, 2025

timmoon10 commented Mar 17, 2025

zigzagcai commented Apr 1, 2025 •

edited

Loading

Does TransformerEngine support FP8 communication such like all-gather or all-to-all? #1579

Does TransformerEngine support FP8 communication such like all-gather or all-to-all? #1579

Comments

zigzagcai commented Mar 14, 2025 • edited Loading

BestJuly commented Mar 17, 2025

timmoon10 commented Mar 17, 2025

zigzagcai commented Apr 1, 2025 • edited Loading

zigzagcai commented Mar 14, 2025 •

edited

Loading

zigzagcai commented Apr 1, 2025 •

edited

Loading