[moe training] Add TP support for routed experts #2473

danielvegamyhre · 2025-07-02T01:41:53Z

Stack

Summary

Adds TP integration test and testing utils file for shared functions used in single GPU / FSDP / TP training tests
Update _scaled_grouped_mm to support 3D A tensor, which is needed for the shared expert (code).
Make offs optional, to handle shared_expert case where num_experts=1 (no group offsets needed since there's only 1 token group).

Test plan

./test/prototype/moe_training/test_tp.sh

pytorch-bot · 2025-07-02T01:41:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2473

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Pending

As of commit b9e58fe with merge base 01f7352 ():

NEW FAILURE - The following job has failed:

Run TorchAO Experimental Tests / test-mps-ops (macos-m1-stable) (gh)
Process completed with exit code 127.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vkuzo · 2025-07-02T12:30:40Z

test/prototype/moe_training/test_tp.py

+from torch.nn import functional as F
+
+# this feature requires CUDA and SM89+
+if not torch.cuda.is_available() or torch.cuda.get_device_capability() < (8, 9):


is the bf16 group gemm working on A100? If yes, I would vote for adding an emulation mode and running this test in emulation mode. We do this for float8 and MX training.

I will look into this and add emulation if bf16 grouped gemm builds on a100

vkuzo · 2025-07-02T12:31:22Z

test/prototype/moe_training/testing_utils.py

+from torchao.prototype.moe_training.tensor import ScaledGroupedMMTensor
+
+
+def _validate_model_conversion(


nit: why do we need recursion to check this?

It's just a generic way of checking all target FQNs were converted properly, and verifying all non-target FQNs were correctly not converted. It can easily be applied when we extend tests to other MoE models as well beyond just the torchtitan llama4 one I started with.

We can remove this assertion, TP support for float8 rowwise MoE training was added in this PR stack: pytorch/ao#2473

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Jul 2, 2025

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 2, 2025

danielvegamyhre mentioned this pull request Jul 2, 2025

[float8 moe training] Add TP support #2425

Closed

danielvegamyhre marked this pull request as draft July 2, 2025 01:45

danielvegamyhre force-pushed the tp branch from 4d05eba to 2f3bb13 Compare July 2, 2025 01:58

danielvegamyhre marked this pull request as ready for review July 2, 2025 02:08

This was referenced Jul 2, 2025

[moe training] Add 2D parallel (FSDP2 + TP) tests for routed experts #2475

Merged

[moe training] Cast to mixed precision policy param dtype in fsdp_pre_all_gather hook #2455

Merged

danielvegamyhre requested review from vkuzo and drisspg July 2, 2025 03:18

vkuzo reviewed Jul 2, 2025

View reviewed changes

vkuzo approved these changes Jul 2, 2025

View reviewed changes

danielvegamyhre force-pushed the dtype branch 2 times, most recently from 105689f to bb9626e Compare July 2, 2025 16:13

danielvegamyhre force-pushed the tp branch 2 times, most recently from cb1eae9 to 7fed93e Compare July 2, 2025 16:48

danielvegamyhre changed the base branch from dtype to main July 2, 2025 17:19

danielvegamyhre force-pushed the tp branch 3 times, most recently from e92f92d to 04a3d2f Compare July 2, 2025 17:29

add tp support for fp8 moe training

b9e58fe

danielvegamyhre force-pushed the tp branch from 04a3d2f to b9e58fe Compare July 2, 2025 19:11

danielvegamyhre merged commit 6821971 into main Jul 2, 2025
18 of 19 checks passed

This was referenced Jul 2, 2025

[WIP] [moe training] Add tests for 3D parallel (FDSP + TP + EP) for routed experts #2481

Draft

[float8 moe] Support TP pytorch/torchtitan#1375

Merged

tianyu-l pushed a commit to pytorch/torchtitan that referenced this pull request Jul 9, 2025

[float8 moe] Support TP (#1375)

be15836

We can remove this assertion, TP support for float8 rowwise MoE training was added in this PR stack: pytorch/ao#2473

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[moe training] Add TP support for routed experts #2473

[moe training] Add TP support for routed experts #2473

Uh oh!

danielvegamyhre commented Jul 2, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 2, 2025 •

edited

Loading

Uh oh!

vkuzo Jul 2, 2025

Uh oh!

danielvegamyhre Jul 2, 2025

Uh oh!

vkuzo Jul 2, 2025

Uh oh!

danielvegamyhre Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

		from torchao.prototype.moe_training.tensor import ScaledGroupedMMTensor


		def _validate_model_conversion(

[moe training] Add TP support for routed experts #2473

[moe training] Add TP support for routed experts #2473

Uh oh!

Conversation

danielvegamyhre commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stack

Summary

Test plan

Uh oh!

pytorch-bot bot commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2473

❌ 1 New Failure, 1 Pending

Uh oh!

vkuzo Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

vkuzo Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danielvegamyhre commented Jul 2, 2025 •

edited

Loading

pytorch-bot bot commented Jul 2, 2025 •

edited

Loading