fix(torchrun): Omit empty arguments and correct nproc_per_node type #661

szaher · 2025-10-03T12:10:06Z

The command generation logic is updated to dynamically build the torchrun command, excluding arguments that are empty or None. This prevents them from overriding environment variables, ensuring that torchrun can
correctly inherit its configuration. An exception is made for integer arguments where 0 is a valid value.

Additionally, the nproc_per_node argument type has been changed from int to str to support special values
accepted by PyTorch, such as 'auto', 'gpu', and 'cpu'.

Reference: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L77-L88

The command generation logic is updated to dynamically build the torchrun command, excluding arguments that are empty or None. This prevents them from overriding environment variables, ensuring that torchrun can correctly inherit its configuration. An exception is made for integer arguments where 0 is a valid value. Additionally, the nproc_per_node argument type has been changed from int to str to support special values accepted by PyTorch, such as 'auto', 'gpu', and 'cpu'. Reference: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L77-L88 Signed-off-by: Saad Zaher <[email protected]>

RobotSail · 2025-10-03T12:32:24Z

src/instructlab/training/config.py

    """

-    nproc_per_node: int
+    nproc_per_node: str


Did you mean to make this change?

RobotSail · 2025-10-03T12:33:13Z

src/instructlab/training/main_ds.py

+    # build args for this file. Ignore empty or unset values except int values
+    for key, value in train_args.model_dump(exclude_none=True).items():
+        # avoid ignoring int attrs with value = 0
+        if not isinstance(value, int) and (not value or value == ""):


How would this handle booleans?

I have updated this one to only check for string types.

RobotSail · 2025-10-03T12:36:34Z

src/instructlab/training/main_ds.py

+        # avoid ignoring int attrs with value = 0
+        if not isinstance(value, int) and (not value or value == ""):
+            continue
+        command.append(f"--{key}={value}")


Have you verified that all of our CLI arguments are perfectly 1:1 with the variable names we're using here?

I have updated this one to only process torchrun args and leave the scripts args as they're not perfectly 1:1 mapped.

RobotSail

Thank you for the PR @szaher , I think the changes here are reasonable and had a few questions about the implementation.

Signed-off-by: Saad Zaher <[email protected]>

RobotSail · 2025-10-03T16:14:40Z

src/instructlab/training/config.py

+    # this will tell the model construct to ignore
+    # extra arguments that aren't part of this model
+    class Config:
+        extra = "ignore"


@szaher Do you know when this would be the case? If our goal here is to dynamically build the torchrun command using the defined interface, this seems like it now opens the floor up for users to pass invalid arguments through torchrun. This means that any incorrect interface usage wouldn't be detected until runtime.

In fact this will actually drop additionally provided arguments and only keep torchrun ones

torchrun_defaults = { 'nnodes': 1, 'node_rank': 0, 'rdzv_id': 0, 'rdzv_endpoint': '', 'nproc_per_node': 2, "fake_arg": "what" } y = TorchrunArgs(**torchrun_defaults) print(y) TorchrunArgs(nproc_per_node=2, nnodes=1, node_rank=0, rdzv_id=0, rdzv_endpoint='')

I see, that's fine then.

Signed-off-by: Saad Zaher <[email protected]>

…torch-env-vars

Signed-off-by: Saad Zaher <[email protected]>

mergify bot added the ci-failure label Oct 3, 2025

RobotSail reviewed Oct 3, 2025

View reviewed changes

szaher mentioned this pull request Oct 3, 2025

feat(traininghub): Use torchrun environment variables for default configuration Red-Hat-AI-Innovation-Team/training_hub#13

Open

only dynamically add torchrun args & change rdzv_id type to str

7b4e82c

Signed-off-by: Saad Zaher <[email protected]>

mergify bot removed the ci-failure label Oct 3, 2025

fix smoke tests

b53bfaf

Signed-off-by: Saad Zaher <[email protected]>

mergify bot added the testing Relates to testing label Oct 3, 2025

Enable both dtypes str, int for nproc_per_node, rdzv_id

9b2b8c9

Signed-off-by: Saad Zaher <[email protected]>

RobotSail reviewed Oct 3, 2025

View reviewed changes

szaher added 7 commits October 6, 2025 10:22

Use python3.11 style for pydatnic model

8186074

Signed-off-by: Saad Zaher <[email protected]>

Remove non-required dependencies

8070933

Signed-off-by: Saad Zaher <[email protected]>

add all torchrun args and validate them

2a4ec07

Signed-off-by: Saad Zaher <[email protected]>

Merge branch 'pytorch-env-vars' of github.com:szaher/training into py…

9c6892e

…torch-env-vars

update datatypes only

854325a

Signed-off-by: Saad Zaher <[email protected]>

replace _ with - when passing torchrun args

682043b

Signed-off-by: Saad Zaher <[email protected]>

make nproc_per_node to only accept gpu or int

169387e

Signed-off-by: Saad Zaher <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(torchrun): Omit empty arguments and correct nproc_per_node type #661

fix(torchrun): Omit empty arguments and correct nproc_per_node type #661

Uh oh!

szaher commented Oct 3, 2025

Uh oh!

RobotSail Oct 3, 2025

Uh oh!

szaher Oct 3, 2025

Uh oh!

RobotSail Oct 3, 2025

Uh oh!

szaher Oct 3, 2025

Uh oh!

RobotSail Oct 3, 2025

Uh oh!

szaher Oct 3, 2025

Uh oh!

RobotSail left a comment

Uh oh!

RobotSail Oct 3, 2025

Uh oh!

szaher Oct 3, 2025 •

edited

Loading

Uh oh!

RobotSail Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(torchrun): Omit empty arguments and correct nproc_per_node type #661

Are you sure you want to change the base?

fix(torchrun): Omit empty arguments and correct nproc_per_node type #661

Uh oh!

Conversation

szaher commented Oct 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szaher Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

szaher Oct 3, 2025 •

edited

Loading