-
Notifications
You must be signed in to change notification settings - Fork 293
[WIP] [moe training] Add tests for 3D parallel (FDSP + TP + EP) for routed experts #2481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2481
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit cfef1fb with merge base 2defe30 ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
15ce7b6
to
f87bf68
Compare
) | ||
|
||
# apply TP and EP | ||
apply_moe_ep_tp( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @tianyu-l I would appreciate your feedback on the parallelism configuration for this test.
Note:
- This is all using your PR for local testing: dp2ep Expert Parallel torchtitan#1324
- The target model is just importing the Llama4 MoE layer from torchtitan directly, not the full transformer.
apply_moe_ep_tp
is a modified version of the torchtitan function, which just removes the for-loop over transformer blocks (since there is no transformer), but is otherwise identical.- I'm attempting to test something like a single-host dp2ep configuration.
Is this correct/useful or do you recommend any adjustments?
WIP because it requires pytorch/pytorch#157216 to land first
Stack
Summary
Test plan
./test/prototype/moe_training/test_fsdp_tp_ep.sh