[WIP] [moe training] Add tests for 3D parallel (FDSP + TP + EP) for routed experts #2481

danielvegamyhre · 2025-07-02T21:51:54Z

WIP because it requires pytorch/pytorch#157216 to land first

Stack

Summary

Test applies 3D parallelism to routed experts, where FSDP=2, TP=2, EP=2

Test plan

./test/prototype/moe_training/test_fsdp_tp_ep.sh

pytorch-bot · 2025-07-02T21:51:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2481

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit cfef1fb with merge base 2defe30 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run TorchAO Experimental Tests / test-mps-ops (macos-m1-stable) (gh) (trunk failure)
Process completed with exit code 127.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

danielvegamyhre · 2025-07-02T23:59:14Z

test/prototype/moe_training/test_fsdp_tp_ep.py

+    )
+
+    # apply TP and EP
+    apply_moe_ep_tp(


cc @tianyu-l I would appreciate your feedback on the parallelism configuration for this test.

Note:

This is all using your PR for local testing: dp2ep Expert Parallel torchtitan#1324

The target model is just importing the Llama4 MoE layer from torchtitan directly, not the full transformer.

apply_moe_ep_tp is a modified version of the torchtitan function, which just removes the for-loop over transformer blocks (since there is no transformer), but is otherwise identical.

I'm attempting to test something like a single-host dp2ep configuration.

Is this correct/useful or do you recommend any adjustments?

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 2, 2025

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Jul 2, 2025

danielvegamyhre mentioned this pull request Jul 2, 2025

[moe training] Add 2D parallel (FSDP2 + TP) tests for routed experts #2475

Merged

danielvegamyhre marked this pull request as draft July 2, 2025 21:59

danielvegamyhre changed the title ~~[WIP] [moe training] Add tests for 3D parallel (FDSP + TP + EP)~~ [WIP] [moe training] Add tests for 3D parallel (FDSP + TP + EP) for routed experts Jul 2, 2025

danielvegamyhre changed the base branch from main to fsdp-tp July 2, 2025 22:06

danielvegamyhre force-pushed the ep branch 2 times, most recently from 15ce7b6 to f87bf68 Compare July 2, 2025 22:54

danielvegamyhre changed the base branch from fsdp-tp to main July 2, 2025 22:54

danielvegamyhre force-pushed the ep branch from f87bf68 to 9ab4b45 Compare July 2, 2025 23:52

danielvegamyhre commented Jul 2, 2025

View reviewed changes

danielvegamyhre force-pushed the ep branch from 9ab4b45 to 3d26da2 Compare July 3, 2025 02:47

add tests for ep

cfef1fb

danielvegamyhre force-pushed the ep branch from 3d26da2 to cfef1fb Compare July 3, 2025 02:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] [moe training] Add tests for 3D parallel (FDSP + TP + EP) for routed experts #2481

[WIP] [moe training] Add tests for 3D parallel (FDSP + TP + EP) for routed experts #2481

danielvegamyhre commented Jul 2, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 2, 2025 •

edited

Loading

Uh oh!

danielvegamyhre Jul 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

[WIP] [moe training] Add tests for 3D parallel (FDSP + TP + EP) for routed experts #2481

Are you sure you want to change the base?

[WIP] [moe training] Add tests for 3D parallel (FDSP + TP + EP) for routed experts #2481

Conversation

danielvegamyhre commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stack

Summary

Test plan

Uh oh!

pytorch-bot bot commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2481

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

danielvegamyhre Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danielvegamyhre commented Jul 2, 2025 •

edited

Loading

pytorch-bot bot commented Jul 2, 2025 •

edited

Loading

danielvegamyhre Jul 2, 2025 •

edited

Loading