You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Added steps to use automation scripts in HMS areas.
Added steps to replace manual install steps with automation
scripts. Kept the manual steps intact for "just in case
the automation script fails" information and next steps.
Copy file name to clipboardexpand all lines: operations/system_configuration_service/Configure_BMC_and_Controller_Parameters_with_scsd.md
+61-24
Original file line number
Diff line number
Diff line change
@@ -35,6 +35,66 @@ The NTP server and syslog server for BMCs in the liquid-cooled cabinet are typic
35
35
36
36
## Details
37
37
38
+
Setting the SSH keys for mountain controllers is done by running the *set_ssh_keys.py* script:
39
+
40
+
```
41
+
Usage: set_ssh_keys.py [options]
42
+
43
+
--debug=level Set debug level
44
+
--dryrun Gather all info but don't set anything in HW.
45
+
--exclude=list Comma-separated list of target patterns to exclude.
46
+
Each item in the list is matched on the front
47
+
of each target XName and excluded if there is a match.
48
+
Example: x1000,x3000c0,x9000c1s0
49
+
This will exclude all BMCs in cabinet x1000,
50
+
all BMCs at or below x3000c0, and all BMCs
51
+
below x9000c1s0.
52
+
NOTE: --include and --exclude are mutually exclusive.
53
+
--include=list Comma-separated list of target patterns to include.
54
+
Each item in the list is matched on the front
55
+
of each target XName and included is there is a match.
56
+
NOTE: --include and --exclude are mutually exclusive.
57
+
--sshkey=key SSH key to set on BMCs. If none is specified, will use
58
+
```
59
+
60
+
If no command line arguments are needed, SSH keys are set on all discovered mountain controllers, using the root account's public RSA key. Using an alternate key requires the --sshkey=key argument:
61
+
62
+
```bash
63
+
# set_ssh_keys.py --sshkey="AAAbbCcDddd...."
64
+
```
65
+
66
+
After the script runs, verify that it worked:
67
+
68
+
1. Test access to a node controller in the liquid-cooled cabinet.
69
+
70
+
SSH into the node controller for the host xname. For example, if the host xname is x1000c1s0b0n0, the
71
+
node controller xname would be x1000c1s0b0.
72
+
73
+
If the node controller is not powered up, this SSH attempt will fail.
74
+
75
+
```bash
76
+
ncn-w001# ssh x1000c1s0b0
77
+
x1000c1s0b0:>
78
+
```
79
+
80
+
Notice that the command prompt includes the hostname for this node controller
81
+
82
+
1. The logs from power actions for node 0 and node 1 on this node controller are in /var/log.
83
+
84
+
```bash
85
+
x1000c1s0b0:>cd /var/log
86
+
x1000c1s0b0:> ls -l powerfault_*
87
+
-rw-r--r-- 1 root root 306 May 10 15:32 powerfault_dn.Node0
88
+
-rw-r--r-- 1 root root 306 May 10 15:32 powerfault_dn.Node1
89
+
-rw-r--r-- 1 root root 5781 May 10 15:36 powerfault_up.Node0
90
+
-rw-r--r-- 1 root root 5781 May 10 15:36 powerfault_up.Node1
91
+
```
92
+
93
+
94
+
## Manual SSH Key Setting Process
95
+
If for whatever reason this script fails, SSH keys can be set manually using the following process:
96
+
97
+
38
98
1. Save the public SSH key for the root user.
39
99
40
100
```bash
@@ -70,28 +130,5 @@ The admin must be authenticated to the Cray CLI before proceeding.
70
130
Check the output to verify all hardware has been set with the correct keys. Passwordless SSH to the root
71
131
user should now functionas expected.
72
132
73
-
1. Test access to a node controller in the liquid-cooled cabinet.
74
-
75
-
SSH into the node controller for the host xname. For example, if the host xname is x1000c1s0b0n0, the
76
-
node controller xname would be x1000c1s0b0.
77
-
78
-
If the node controller is not powered up, this SSH attempt will fail.
79
-
80
-
```bash
81
-
ncn-w001# ssh x1000c1s0b0
82
-
x1000c1s0b0:>
83
-
```
84
-
85
-
Notice that the command prompt includes the hostname for this node controller
86
-
87
-
1. The logs from power actions fornode 0 and node 1 on this node controller arein /var/log.
88
-
89
-
```bash
90
-
x1000c1s0b0:>cd /var/log
91
-
x1000c1s0b0:> ls -l powerfault_*
92
-
-rw-r--r-- 1 root root 306 May 10 15:32 powerfault_dn.Node0
93
-
-rw-r--r-- 1 root root 306 May 10 15:32 powerfault_dn.Node1
94
-
-rw-r--r-- 1 root root 5781 May 10 15:36 powerfault_up.Node0
95
-
-rw-r--r-- 1 root root 5781 May 10 15:36 powerfault_up.Node1
Examine the output. If one or more failures occur, investigate the cause of each failure. See the [interpreting_hms_health_check_results](../troubleshooting/interpreting_hms_health_check_results.md) documentation for more information.
Examine the output. If one or more failures occur, investigate the cause of each failure. See the [interpreting_hms_health_check_results](../troubleshooting/interpreting_hms_health_check_results.md) documentation for more information.
515
529
530
+
<a name="hms-aruba-fixup"></a>
531
+
### 2.2 Aruba Switch SNMP Fixup
532
+
533
+
Systems with Aruba leaf switches sometimes have issues with a known SNMP bug
534
+
which prevents HSM discovery from discovering all HW. At this stage of the
535
+
installation process, a script can be run to detect if this issue is
536
+
currently affecting the system, and if so, correct it.
537
+
538
+
Refer to [Air cooled hardware is not getting properly discovered with Aruba leaf switches](../troubleshooting/known_issues/discovery_aruba_snmp_issue.md) for
539
+
details.
540
+
516
541
<a name="hms-smd-discovery-validation"></a>
517
-
### 2.2 Hardware State Manager Discovery Validation
542
+
### 2.3 Hardware State Manager Discovery Validation
518
543
519
544
By this point in the installation process, the Hardware State Manager (HSM) should
520
545
have done its discovery of the system.
@@ -523,7 +548,8 @@ The foundational information for this discovery is from the System Layout Servic
523
548
comparison needs to be done to see that what is specified in SLS (focusing on
524
549
BMC components and Redfish endpoints) are present in HSM.
525
550
526
-
Execute the `verify_hsm_discovery.py` script on a Kubernetes master or worker NCN:
551
+
To perform this comparison execute the `verify_hsm_discovery.py` script on a Kubernetes master or worker NCN. The result is pass/fail (returns 0 or non-zero):
Copy file name to clipboardexpand all lines: troubleshooting/known_issues/discovery_aruba_snmp_issue.md
+88-2
Original file line number
Diff line number
Diff line change
@@ -5,7 +5,93 @@
5
5
- Air cooled hardware is reported to not be present under State Components and Inventory Redfish Endpoints in Hardware State Manager by the hsm_discovery_verify.sh script.
6
6
- BMCs have IP addresses given out by DHCP, but in DNS their xname hostname does not resolve.
7
7
8
-
## Procedure to determine if you affected by this known issue:
8
+
## Procedure to determine if you affected by this known issue
9
+
10
+
Run the `arubafix.sh` script. Executed with no arguments, this script will
11
+
see if this problem currently exists on the system, and if so, fix it.
12
+
13
+
**NOTE: This script requires the admin to enter the Aruba switch management interface's admin password.** This script is thus not completely automatic.
14
+
15
+
```
16
+
Usage: arubafix.sh [-h] [-d] [-t]
17
+
18
+
-h Help text
19
+
-d Print debug info during execution.
20
+
-t Test mode, don't touch Aruba switches.
21
+
```
22
+
23
+
#### Multiple Runs Are Potentially Needed
24
+
25
+
This script needs to be run twice if the first run finds issues; if it finds no issues, one run is sufficient.
26
+
27
+
The first run is to check for the issue and fix issues it finds; the second run is to verify that the issue is fixed.
28
+
29
+
If two runs are needed, sufficient time needs to be allowed between runs for an HSM discovery job to run. This job runs every five minutes; thus the admin should wait at least 6 minutes to be sure the discovery job has run before running a second execution.
If there are any MAC addresses on the left column that are not on the right column, then it is likely the leaf switches in the system are being affected by the SNMP issue. Apply the workaround described in [the following procedure](../../install/aruba_snmp_known_issue_10_06_0010.md) to the Aruba leaf switches in the system.
48
134
49
-
If all of the MAC addresses on the left column are present in the right column, then you are not affected by this known issue.
135
+
If all of the MAC addresses on the left column are present in the right column, then you are not affected by this known issue.
0 commit comments