Skip to content

[WIP] [moe training] Add tests for 3D parallel (FDSP + TP + EP) for routed experts #2481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Jul 2, 2025

Copy link

pytorch-bot bot commented Jul 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2481

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit cfef1fb with merge base 2defe30 (image):

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 2, 2025
@danielvegamyhre danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Jul 2, 2025
@danielvegamyhre danielvegamyhre marked this pull request as draft July 2, 2025 21:59
@danielvegamyhre danielvegamyhre changed the title [WIP] [moe training] Add tests for 3D parallel (FDSP + TP + EP) [WIP] [moe training] Add tests for 3D parallel (FDSP + TP + EP) for routed experts Jul 2, 2025
@danielvegamyhre danielvegamyhre changed the base branch from main to fsdp-tp July 2, 2025 22:06
@danielvegamyhre danielvegamyhre force-pushed the ep branch 2 times, most recently from 15ce7b6 to f87bf68 Compare July 2, 2025 22:54
@danielvegamyhre danielvegamyhre changed the base branch from fsdp-tp to main July 2, 2025 22:54
)

# apply TP and EP
apply_moe_ep_tp(
Copy link
Contributor Author

@danielvegamyhre danielvegamyhre Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @tianyu-l I would appreciate your feedback on the parallelism configuration for this test.

Note:

  1. This is all using your PR for local testing: dp2ep Expert Parallel torchtitan#1324
  2. The target model is just importing the Llama4 MoE layer from torchtitan directly, not the full transformer.
  3. apply_moe_ep_tp is a modified version of the torchtitan function, which just removes the for-loop over transformer blocks (since there is no transformer), but is otherwise identical.
  4. I'm attempting to test something like a single-host dp2ep configuration.

Is this correct/useful or do you recommend any adjustments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing Use this tag if you don't want this PR to show up in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants