Skip to content

Commit e9db365

Browse files
authored
Enforce that numRocGdr must be 0 unless numProcs > 1 (#63)
1 parent 57eb87a commit e9db365

File tree

3 files changed

+14
-3
lines changed

3 files changed

+14
-3
lines changed

tools/pytorchjob-generator/chart/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ customize the Jobs generated by the tool.
5050
| Key | Type | Default | Description |
5151
|-----|------|---------|-------------|
5252
| roceGdrResName | string | nvidia.com/roce_gdr | RoCE GDR resource name (can vary by cluster configuration) |
53-
| numRoceGdr | integer | `0` | number of nvidia.com/roce_grd resources (0 means disabled; >0 means enable GDR over RoCE). |
53+
| numRoceGdr | integer | `0` | number of nvidia.com/roce_grd resources (0 means disabled; >0 means enable GDR over RoCE). Must be 0 unless numPods > 1. |
5454
| topologyFileConfigMap | string | `nil` | Name of configmap containining /var/run/nvidia-topologyd/virtualTopology.xml for the system e.g. nvidia-topo-gdr |
5555
| ncclGdrEnvConfigMap | string | `nil` | Name of configmap containing NCCL networking environment variables for the system e.g. nccl-netwk-env-vars |
5656
| multiNicNetworkName | string | `nil` | Name of multi-NIC network, if one is available. Note: when GDR over RoCE is used/available, the RoCE multi-nic network instance should be specified here instead of the TCP multi-nic network instance. Existing instance names can be listed with `oc get multinicnetwork`. |

tools/pytorchjob-generator/chart/values.schema.json

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@
7979
{ "type": "null" },
8080
{ "type": "string" }
8181
]},
82-
"numRoceGdr": { "type": "integer" },
82+
"numRoceGdr": { "type": "integer", "minimum": 0, "maximum": 2 },
8383
"topologyFileConfigMap": { "oneOf": [
8484
{ "type": "null" },
8585
{ "$ref": "#/$defs/rfc1123Label" }
@@ -134,6 +134,17 @@
134134
"deletionOnFailureGracePeriodDuration" : { "$ref": "#/$defs/duration" }
135135
},
136136

137+
"if": {
138+
"properties": {
139+
"numPods": { "const": 1 }
140+
}
141+
},
142+
"then": {
143+
"properties": {
144+
"numRoceGdr": { "const": 0 }
145+
}
146+
},
147+
137148
"$defs": {
138149
"rfc1123Label": {
139150
"type": "string",

tools/pytorchjob-generator/chart/values.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,7 +157,7 @@ volumes:
157157
# @section -- Advanced Options
158158
roceGdrResName: # <optional, default="">
159159

160-
# -- (integer) number of nvidia.com/roce_grd resources (0 means disabled; >0 means enable GDR over RoCE).
160+
# -- (integer) number of nvidia.com/roce_grd resources (0 means disabled; >0 means enable GDR over RoCE). Must be 0 unless numPods > 1.
161161
# @section -- Advanced Options
162162
numRoceGdr: 0
163163

0 commit comments

Comments
 (0)