[moe training] Cast to mixed precision policy param dtype in fsdp_pre_all_gather hook #2455

danielvegamyhre · 2025-06-27T19:29:52Z

Stack

Summary

After examining this example and discussing with @weifengpy, we think that when implementing fsdp pre/post all gather hooks for a tensor subclass, the developer must handle casting params to the MP policy param dtype in the pre all-gather hook themselves.
- I thought the downcasting to save network bandwidth via bf16 all gather instead of fp32 would be done automatically by fsdp, but it appears when implementing these hooks we have to handle it ourselves.
- Tracking issue to determine contract between MP policy and tensor subclass extension: [FSDP2] figure out the contract for mp_policy and tensor subclass extention pytorch#157395
Update fsdp_post_all_gather_hook to more correctly handle case where out != None (see code comments for details)
Remove dtype param from ScaledGroupedMM tensor, as it is no longer needed when doing the casting in pre all gather rather than post all gather.

Test plan

./test/prototype/moe_training/test_fsdp.sh
Manual test with torchtitan llama4 debug model: NGPU=2 CONFIG_FILE="./torchtitan/experiments/llama4/train_configs/debug_model.toml" ./run_train.sh --training.steps=10 --model.converters="float8" --float8.recipe_name="rowwise" --float8.moe_fqns_prototype="experts"

pytorch-bot · 2025-06-27T19:29:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2455

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2025-06-27T19:30:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2455

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 5 Pending

As of commit bb9626e with merge base ac14d92 ():

NEW FAILURE - The following job has failed:

Run TorchAO Experimental Tests / test-mps-ops (macos-m1-stable) (gh)
Process completed with exit code 127.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchao/prototype/moe_training/tensor.py

danielvegamyhre · 2025-07-01T21:35:21Z

cc @drisspg @vkuzo for review

vkuzo · 2025-07-02T12:28:19Z

torchao/prototype/moe_training/tensor.py

+                out._data.copy_(data)
+            return
+
+        # For training step 0, out=None, so we need to return a new ScaledGroupedMMTensor.


do we have a test for this?

We have a test for float MoE + FSDP training. We don't have a test verifying which code branch is followed in this fsdp_post_all_gather hook at training step 0 vs 1, but I think the FSDP test alone is sufficient. Let me know if you have other thoughts.

danielvegamyhre · 2025-07-02T17:19:03Z

MPS test failures are unrelated to this change

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 27, 2025

danielvegamyhre marked this pull request as draft June 27, 2025 19:30

danielvegamyhre added 2 commits June 27, 2025 14:55

set dtype

02f061c

fix dtype bug and add logging

41a7890

danielvegamyhre force-pushed the dtype branch from acc9889 to 41a7890 Compare July 1, 2025 02:22

weifengpy reviewed Jul 1, 2025

View reviewed changes

torchao/prototype/moe_training/tensor.py Show resolved Hide resolved

danielvegamyhre added 2 commits July 1, 2025 13:08

debug

5360aad

don't have dtype param

7fdba52

danielvegamyhre changed the title ~~[WIP] [moe training] ScaledGroupedMMTensor - set dtype~~ [moe training] Cast to MP policy param dtype in fsdp_pre_all_gather hook Jul 1, 2025

danielvegamyhre marked this pull request as ready for review July 1, 2025 20:49

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Jul 1, 2025

danielvegamyhre force-pushed the dtype branch 2 times, most recently from 8df3fbb to fd933ea Compare July 1, 2025 21:28

danielvegamyhre requested review from vkuzo and drisspg July 1, 2025 21:35

danielvegamyhre changed the title ~~[moe training] Cast to MP policy param dtype in fsdp_pre_all_gather hook~~ [moe training] Cast to mixed precision policy param dtype in fsdp_pre_all_gather hook Jul 1, 2025

danielvegamyhre force-pushed the dtype branch from fd933ea to 6ca070d Compare July 2, 2025 00:04

This was referenced Jul 2, 2025

[moe training] Add TP support for routed experts #2473

Merged

[moe training] Add 2D parallel (FSDP2 + TP) tests for routed experts #2475

Merged

vkuzo reviewed Jul 2, 2025

View reviewed changes

vkuzo approved these changes Jul 2, 2025

View reviewed changes

danielvegamyhre force-pushed the dtype branch from 6ca070d to 105689f Compare July 2, 2025 16:10

handle out != None

bb9626e

danielvegamyhre force-pushed the dtype branch from 105689f to bb9626e Compare July 2, 2025 16:13

danielvegamyhre merged commit 01f7352 into main Jul 2, 2025
18 of 19 checks passed

danielvegamyhre mentioned this pull request Jul 2, 2025

[WIP] [moe training] Add tests for 3D parallel (FDSP + TP + EP) for routed experts #2481

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[moe training] Cast to mixed precision policy param dtype in fsdp_pre_all_gather hook #2455

[moe training] Cast to mixed precision policy param dtype in fsdp_pre_all_gather hook #2455

danielvegamyhre commented Jun 27, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 27, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

danielvegamyhre commented Jul 1, 2025

Uh oh!

vkuzo Jul 2, 2025

Uh oh!

danielvegamyhre Jul 2, 2025

Uh oh!

danielvegamyhre commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

[moe training] Cast to mixed precision policy param dtype in fsdp_pre_all_gather hook #2455

[moe training] Cast to mixed precision policy param dtype in fsdp_pre_all_gather hook #2455

Conversation

danielvegamyhre commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stack

Summary

Test plan

Uh oh!

pytorch-bot bot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2455

Uh oh!

pytorch-bot bot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2455

❌ 1 New Failure, 5 Pending

Uh oh!

Uh oh!

danielvegamyhre commented Jul 1, 2025

Uh oh!

vkuzo Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

danielvegamyhre commented Jun 27, 2025 •

edited

Loading

pytorch-bot bot commented Jun 27, 2025 •

edited

Loading

pytorch-bot bot commented Jun 27, 2025 •

edited

Loading