Skip to content

Conversation

@nipung90
Copy link
Contributor

@nipung90 nipung90 commented Nov 5, 2025

Summary:
This diff enables the static logging functionality to collect data for:

  1. plan() - This will allow us to look at the inputs and outputs to the planner to help with use issue debugging
  2. ShardEstimators - This will allow us to look at the inputs and outputs to the ShardEstimators, which gives us the bandwidth inputs to verify if the planner is generating expected values as well as help with debugging OOMs
  3. TrainingPipeline - The class type here will be an indicator of which pipeline was used by the training job. The training pipeline has implications on the memory usage and is an important data point to collect to investigate OOMs.

Reviewed By: kausv

Differential Revision: D86317910

@meta-codesync
Copy link
Contributor

meta-codesync bot commented Nov 5, 2025

@nipung90 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86317910.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 5, 2025
nipung90 added a commit to nipung90/torchrec that referenced this pull request Nov 6, 2025
…ipeline class constructors (meta-pytorch#3521)

Summary:

This diff enables the static logging functionality to collect data for:
1) plan() - This will allow us to look at the inputs and outputs to the planner to help with use issue debugging
2) ShardEstimators - This will allow us to look at the inputs and outputs to the ShardEstimators, which gives us the bandwidth inputs to verify if the planner is generating expected values as well as help with debugging OOMs
3) TrainingPipeline - The class type here will be an indicator of which pipeline was used by the training job. The training pipeline has implications on the memory usage and is an important data point to collect to investigate OOMs.

Reviewed By: kausv

Differential Revision: D86317910
nipung90 added a commit to nipung90/torchrec that referenced this pull request Nov 6, 2025
…ipeline class constructors (meta-pytorch#3521)

Summary:

This diff enables the static logging functionality to collect data for:
1) plan() - This will allow us to look at the inputs and outputs to the planner to help with use issue debugging
2) ShardEstimators - This will allow us to look at the inputs and outputs to the ShardEstimators, which gives us the bandwidth inputs to verify if the planner is generating expected values as well as help with debugging OOMs
3) TrainingPipeline - The class type here will be an indicator of which pipeline was used by the training job. The training pipeline has implications on the memory usage and is an important data point to collect to investigate OOMs.

Reviewed By: kausv

Differential Revision: D86317910
…ipeline class constructors (meta-pytorch#3521)

Summary:

This diff enables the static logging functionality to collect data for:
1) plan() - This will allow us to look at the inputs and outputs to the planner to help with use issue debugging
2) ShardEstimators - This will allow us to look at the inputs and outputs to the ShardEstimators, which gives us the bandwidth inputs to verify if the planner is generating expected values as well as help with debugging OOMs
3) TrainingPipeline - The class type here will be an indicator of which pipeline was used by the training job. The training pipeline has implications on the memory usage and is an important data point to collect to investigate OOMs.

Reviewed By: kausv

Differential Revision: D86317910
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant