This procedure will rearrange NIDs for specified compute nodes to create a numerically (NID) and lexicographically (xname) contiguous block of NIDs at the specified start point.
It is recommended that the system be taken down for maintenance while performing this procedure.
This procedure should only be performed if absolutely required. Some reasons for needing to perform this procedure include:
- Compute nodes were added to SLS with incorrect NID numbering, missing node entries, and/or extra node entries.
- Compute nodes were permanently moved, removed, or re-provisioned and there is a desire to remove NID numbering gaps.
The example in this procedure removes NID gaps from 2 cabinets of compute nodes that were a result of incorrect numbering in SLS.
- Prerequisites
- Defragment NID script functionality and limitations
- NID defragmentation procedure
- Troubleshooting
- Hardware State Manager's (HSM) inventory data is up to date (Hardware has not changed since the previous discovery).
- Chassis-level entries exist in the System Layout Service (SLS) and are correct for all Mountain and Hill chassis.
- All compute nodes are powered off.
In the process of defragmenting NIDs, the defragment_nids.py
script will:
- Update the NID numbers for compute node entries in HSM.
- Update/create SLS compute node entries with correct NID numbering and aliases.
- Remove node entries from HSM that were not previously removed due to missed blade swap procedure steps. This includes associated entries
under
/State/Components
,/Inventory/ComponentEndpoints
,/Inventory/Hardware
, and/Inventory/EthernetInterfaces
. - Remove SLS node entries with conflicting NIDs that are not in HSM.
- Correct the Class designation of compute nodes in HSM.
Limitations of the defragment_nids.py
script:
- Only affects compute nodes.
- HSM node entries are only removed if the "leftover" nodes are of a different model than the existing nodes in the same slot.
- SLS node entries that do not exist in HSM are only removed if their NID falls within the specified NID block.
-
(
ncn-mw#
) Choose the starting NID for the NID block (e.g., 1000).export NID_START=1000
-
(
ncn-mw#
) Choose the components to include in the NID block (e.g.,x1000
,x3000
).This can be specified at cabinet (
x#
), chassis (x#c#
), slot (x#c#s#
), or even node level (x#c#s#b#n#
). This list always gets expanded to include all compute nodes contained by the specified parent components.export INCLUDE_LIST=x1000,x3000
-
Run
defragment_nids.py
.NOTE: Administrators can do a dryrun of
defragment_nids.py
to print out a report of what will happen without affecting the system's NID numbering by specifying--dryrun
./usr/share/doc/csm/scripts/operations/node_management/defragment_nids.py --start ${NID_START} --include ${INCLUDE_LIST} | jq .
Example (summarized) output:
{ "Description": "NID Defragmentation Report", "StartingNID": 1000, "Include": [ "x1000", "x3000" ], "HSMChanges": [ { "ID": "x1000c0s0b0n0", "OldNID": 1000, "NewNID": 1000 }, { "ID": "x1000c0s0b0n1", "OldNID": 1001, "NewNID": 1001 }, { "ID": "x1000c0s0b1n0", "OldNID": 1002, "NewNID": 1002 }, ... { "ID": "x1000c0s2b1n0", "OldNID": 1010, "NewNID": 1009 }, { "ID": "x1000c0s3b0n0", "OldNID": 1012, "NewNID": 1010 }, { "ID": "x1000c0s3b0n1", "OldNID": 1013, "NewNID": 1011 }, { "ID": "x3001c0s1b1n0", "OldNID": 1, "NewNID": 1012 }, ... { "ID": "x3000c0s6b0n0", "OldNID": 20, "NewNID": 1020 } ], "SLSEntries": [ { "Xname": "x1000c0s0b0n0", "Class": "Hill", "ExtraProperties": { "Aliases": [ "nid001000" ], "NID": 1000, "Role": "Compute" } }, { "Xname": "x1000c0s0b0n1", "Class": "Hill", "ExtraProperties": { "Aliases": [ "nid001001" ], "NID": 1001, "Role": "Compute" } }, { "Xname": "x1000c0s0b1n0", "Class": "Hill", "ExtraProperties": { "Aliases": [ "nid001002" ], "NID": 1002, "Role": "Compute" } }, ... { "Xname": "x1000c0s2b1n0", "Class": "Hill", "ExtraProperties": { "Aliases": [ "nid001009" ], "NID": 1009, "Role": "Compute" } }, { "Xname": "x1000c0s3b0n0", "Class": "Hill", "ExtraProperties": { "Aliases": [ "nid001010" ], "NID": 1010, "Role": "Compute" } }, { "Xname": "x1000c0s3b0n1", "Class": "Hill", "ExtraProperties": { "Aliases": [ "nid001011" ], "NID": 1011, "Role": "Compute" } }, { "Xname": "x3000c0s1b1n0", "Class": "River", "ExtraProperties": { "Aliases": [ "nid001012" ], "NID": 1012, "Role": "Compute" } }, ... { "Xname": "x3000c0s6b0n0", "Class": "River", "ExtraProperties": { "Aliases": [ "nid001020" ], "NID": 1020, "Role": "Compute" } } ], "NodesRemovedFromHSM": [], "NodesRemovedFromSLS": [ "x1000c0s2b0n1", "x1000c0s2b1n1" ], "Errors": [] }
Example output if
--output text
is specified:NID Defragmentation Report ================= Starting NID: 1000 Include: ['x1000', 'x3000'] ================= HSM Changes: x1000c0s0b0n0 1000 -> 1000 x1000c0s0b0n1 1001 -> 1001 x1000c0s0b1n0 1002 -> 1002 x1000c0s0b1n1 1003 -> 1003 x1000c0s1b0n0 1004 -> 1004 x1000c0s1b0n1 1005 -> 1005 x1000c0s1b1n0 1006 -> 1006 x1000c0s1b1n1 1007 -> 1007 x1000c0s2b0n0 1008 -> 1008 x1000c0s2b1n0 1010 -> 1009 x1000c0s3b0n0 1012 -> 1010 x1000c0s3b0n1 1013 -> 1011 x3000c0s1b1n0 1 -> 1012 x3000c0s1b2n0 2 -> 1013 x3000c0s1b3n0 3 -> 1014 x3000c0s1b4n0 4 -> 1015 x3000c0s3b1n0 11 -> 1016 x3000c0s3b2n0 12 -> 1017 x3000c0s3b3n0 13 -> 1018 x3000c0s3b4n0 14 -> 1019 x3000c0s6b0n0 20 -> 1020 SLS Entries: {"Xname": "x1000c0s0b0n0", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001000"], "NID": 1000, "Role": "Compute"}} {"Xname": "x1000c0s0b0n1", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001001"], "NID": 1001, "Role": "Compute"}} {"Xname": "x1000c0s0b1n0", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001002"], "NID": 1002, "Role": "Compute"}} {"Xname": "x1000c0s0b1n1", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001003"], "NID": 1003, "Role": "Compute"}} {"Xname": "x1000c0s1b0n0", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001004"], "NID": 1004, "Role": "Compute"}} {"Xname": "x1000c0s1b0n1", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001005"], "NID": 1005, "Role": "Compute"}} {"Xname": "x1000c0s1b1n0", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001006"], "NID": 1006, "Role": "Compute"}} {"Xname": "x1000c0s1b1n1", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001007"], "NID": 1007, "Role": "Compute"}} {"Xname": "x1000c0s2b0n0", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001008"], "NID": 1008, "Role": "Compute"}} {"Xname": "x1000c0s2b1n0", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001009"], "NID": 1009, "Role": "Compute"}} {"Xname": "x1000c0s3b0n0", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001010"], "NID": 1010, "Role": "Compute"}} {"Xname": "x1000c0s3b0n1", "Class": "Hill", "ExtraProperties": {"Aliases": ["nid001011"], "NID": 1011, "Role": "Compute"}} {"Xname": "x3000c0s1b1n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001012"], "NID": 1012, "Role": "Compute"}} {"Xname": "x3000c0s1b2n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001013"], "NID": 1013, "Role": "Compute"}} {"Xname": "x3000c0s1b3n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001014"], "NID": 1014, "Role": "Compute"}} {"Xname": "x3000c0s1b4n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001015"], "NID": 1015, "Role": "Compute"}} {"Xname": "x3000c0s3b1n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001016"], "NID": 1016, "Role": "Compute"}} {"Xname": "x3000c0s3b2n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001017"], "NID": 1017, "Role": "Compute"}} {"Xname": "x3000c0s3b3n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001018"], "NID": 1018, "Role": "Compute"}} {"Xname": "x3000c0s3b4n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001019"], "NID": 1019, "Role": "Compute"}} {"Xname": "x3000c0s6b0n0", "Class": "River", "ExtraProperties": {"Aliases": ["nid001020"], "NID": 1020, "Role": "Compute"}} Nodes Removed From SLS: x1000c0s2b0n1,x1000c0s2b1n1
DVS node maps on NCN worker nodes and gateway nodes have entries of compute nodes that include their NIDs. Because of that, the NID defragmentation process will impact the NCN worker and gateway nodes.
Carry out the Procedure To Perform After CSM Defragmentation of Compute Node Identifiers documented in publication HPE Cray Supercomputing User Services Software Administration Guide: CSM on HPE Cray EX Systems (1.0.0 Rev A) (S-8063).
The defragment_nids.py
script checks for HSM discovery errors on the specified nodes before proceeding. It will return an error if any are found. For example:
{
"Message": "Discovery errors detected.",
"Severity": "Error",
"IDs": ["x1000c0s1b1", "x3000c0s6b0"]
}
To continue with the NID defragmentation an administrator must first debug any discovery errors such that all specified components have a discovery status of DiscoverOK
in HSM.
See Troubleshoot Issues with Redfish Endpoint Discovery for debugging discovery issues.
Alternately, if these issues are known and will not affect the desired resulting NID numbering, the --ignore-discovery-errors
option may be specified with defragment_nids.py
to continue through these errors.
Warning: Continuing through discovery errors may result in incorrect NID numbering if HSM's inventory data for those nodes is missing or incorrect.
The defragment_nids.py
script checks for nodes with NIDs that fall within the specified NID block that are not specified in the include list. An example of this error is:
{
"Message": "There is an unexpected node NID in the requested NID range, 1000-1100",
"Severity": "Error",
"IDs": ["x3001c0s0b0n0", "x3001c0s0b0n1"]
}
These might be NCNs and UANs or compute nodes that were not covered by the specified include list. Here are some scenarios and how to fix them:
-
Computes nodes in cabinets
x1000
andx1002
were specified in the include list, so the new NID block is 1000-1100, but the compute nodes in cabinetx1001
have NIDs 1090-1140. This would create a conflict sodefragment_nids.py
will return an error. This can be fixed by:- Change the starting NID for the new NID block (e.g., 1200).
- Include
x1001
in the include list to include it in the new NID block. - Run
defragment_nids.py
to first move the computes nodes inx1001
to another NID block then rerundefragment_nids.py
for the compute nodes in cabinetsx1000
andx1002
.
-
Computes nodes in cabinet
x1000
were specified in the include list, so the new NID block is 1000-1100, butx1000c1b0n0
is a UAN that was given the NID 1000. This would create a conflict sodefragment_nids.py
will return an error. This can be fixed by:- Change the starting NID for the new NID block (e.g., 1001).
- Manually change the NID of the UAN in HSM and SLS then rerun
defragment_nids.py
for the nodes inx1000
.
-
Computes nodes in cabinet
x1000
were specified in the include list, so the new NID block is 1000-1100, butx3000c0b0n0
is an NCN that was given the NID 1000. This would create a conflict sodefragment_nids.py
will return an error. It is not recommended to try and change the NID of an NCN. The best course of action is to change the starting NID for the new NID block.