Skip to content

Commit f157a72

Browse files
committed
Added steps to use automation scripts in HMS areas.
Added steps to replace manual install steps with automation scripts. Kept the manual steps intact for "just in case the automation script fails" information and next steps.
1 parent c686b99 commit f157a72

File tree

4 files changed

+191
-46
lines changed

4 files changed

+191
-46
lines changed

.version

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
1.12.2
1+
1.12.3

operations/system_configuration_service/Configure_BMC_and_Controller_Parameters_with_scsd.md

+61-24
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,66 @@ The NTP server and syslog server for BMCs in the liquid-cooled cabinet are typic
3535

3636
## Details
3737

38+
Setting the SSH keys for mountain controllers is done by running the *set_ssh_keys.py* script:
39+
40+
```
41+
Usage: set_ssh_keys.py [options]
42+
43+
--debug=level Set debug level
44+
--dryrun Gather all info but don't set anything in HW.
45+
--exclude=list Comma-separated list of target patterns to exclude.
46+
Each item in the list is matched on the front
47+
of each target XName and excluded if there is a match.
48+
Example: x1000,x3000c0,x9000c1s0
49+
This will exclude all BMCs in cabinet x1000,
50+
all BMCs at or below x3000c0, and all BMCs
51+
below x9000c1s0.
52+
NOTE: --include and --exclude are mutually exclusive.
53+
--include=list Comma-separated list of target patterns to include.
54+
Each item in the list is matched on the front
55+
of each target XName and included is there is a match.
56+
NOTE: --include and --exclude are mutually exclusive.
57+
--sshkey=key SSH key to set on BMCs. If none is specified, will use
58+
```
59+
60+
If no command line arguments are needed, SSH keys are set on all discovered mountain controllers, using the root account's public RSA key. Using an alternate key requires the --sshkey=key argument:
61+
62+
```bash
63+
# set_ssh_keys.py --sshkey="AAAbbCcDddd...."
64+
```
65+
66+
After the script runs, verify that it worked:
67+
68+
1. Test access to a node controller in the liquid-cooled cabinet.
69+
70+
SSH into the node controller for the host xname. For example, if the host xname is x1000c1s0b0n0, the
71+
node controller xname would be x1000c1s0b0.
72+
73+
If the node controller is not powered up, this SSH attempt will fail.
74+
75+
```bash
76+
ncn-w001# ssh x1000c1s0b0
77+
x1000c1s0b0:>
78+
```
79+
80+
Notice that the command prompt includes the hostname for this node controller
81+
82+
1. The logs from power actions for node 0 and node 1 on this node controller are in /var/log.
83+
84+
```bash
85+
x1000c1s0b0:> cd /var/log
86+
x1000c1s0b0:> ls -l powerfault_*
87+
-rw-r--r-- 1 root root 306 May 10 15:32 powerfault_dn.Node0
88+
-rw-r--r-- 1 root root 306 May 10 15:32 powerfault_dn.Node1
89+
-rw-r--r-- 1 root root 5781 May 10 15:36 powerfault_up.Node0
90+
-rw-r--r-- 1 root root 5781 May 10 15:36 powerfault_up.Node1
91+
```
92+
93+
94+
## Manual SSH Key Setting Process
95+
If for whatever reason this script fails, SSH keys can be set manually using the following process:
96+
97+
3898
1. Save the public SSH key for the root user.
3999

40100
```bash
@@ -70,28 +130,5 @@ The admin must be authenticated to the Cray CLI before proceeding.
70130
Check the output to verify all hardware has been set with the correct keys. Passwordless SSH to the root
71131
user should now function as expected.
72132
73-
1. Test access to a node controller in the liquid-cooled cabinet.
74-
75-
SSH into the node controller for the host xname. For example, if the host xname is x1000c1s0b0n0, the
76-
node controller xname would be x1000c1s0b0.
77-
78-
If the node controller is not powered up, this SSH attempt will fail.
79-
80-
```bash
81-
ncn-w001# ssh x1000c1s0b0
82-
x1000c1s0b0:>
83-
```
84-
85-
Notice that the command prompt includes the hostname for this node controller
86-
87-
1. The logs from power actions for node 0 and node 1 on this node controller are in /var/log.
88-
89-
```bash
90-
x1000c1s0b0:> cd /var/log
91-
x1000c1s0b0:> ls -l powerfault_*
92-
-rw-r--r-- 1 root root 306 May 10 15:32 powerfault_dn.Node0
93-
-rw-r--r-- 1 root root 306 May 10 15:32 powerfault_dn.Node1
94-
-rw-r--r-- 1 root root 5781 May 10 15:36 powerfault_up.Node0
95-
-rw-r--r-- 1 root root 5781 May 10 15:36 powerfault_up.Node1
96-
```
133+
1. Verify correct SSH operation as shown above.
97134

operations/validate_csm_health.md

+41-19
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,9 @@ The areas should be tested in the order they are listed on this page. Errors in
2929
- [1.8.1 Known Test Issues](#autogoss-issues)
3030
- [1.9 OPTIONAL Check of System Management Monitoring Tools](#optional-check-of-system-management-monitoring-tools)
3131
- [2. Hardware Management Services Health Checks](#hms-health-checks)
32-
- [2.1 HMS Test Execution](#hms-test-execution)
33-
- [2.2 Hardware State Manager Discovery Validation](#hms-smd-discovery-validation)
32+
- [2.1 HMS CT Test Execution](#hms-test-execution)
33+
- [2.2 Aruba Switch SNMP Fixup](#hms-aruba-fixup)
34+
- [2.3 Hardware State Manager Discovery Validation](#hms-smd-discovery-validation)
3435
- [2.2.1 Interpreting results](#hms-smd-discovery-validation-interpreting-results)
3536
- [2.2.2 Known Issues](#hms-smd-discovery-validation-known-issues)
3637
- [3 Software Management Services Health Checks](#sms-health-checks)
@@ -495,26 +496,50 @@ Information to assist with troubleshooting some of the components mentioned in t
495496
Execute the HMS smoke and functional tests after the CSM install to confirm that the Hardware Management Services are running and operational.
496497
497498
<a name="hms-test-execution"></a>
498-
### 2.1 HMS Test Execution
499+
### 2.1 HMS CT Test Execution
499500
500501
These tests should be executed as root on at least one worker NCN and one master NCN (but **not** ncn-m001 if it is still the PIT node).
501502
502-
Run the HMS smoke tests.
503+
Run the HMS CT smoke tests. This is done by running the `run_hms_ct_tests.sh` script:
504+
505+
```
506+
ncn# /opt/cray/csm/scripts/hms_verification/run_hms_ct_tests.sh
507+
```
508+
509+
The return value of the script is 0 if all CT tests ran successfully, non-zero
510+
if not.
511+
512+
#### Running CT Tests Manually
513+
514+
To run the tests manually:
515+
503516
```
504517
ncn# /opt/cray/tests/ncn-resources/hms/hms-test/hms_run_ct_smoke_tests_ncn-resources.sh
505518
```
506519
507520
Examine the output. If one or more failures occur, investigate the cause of each failure. See the [interpreting_hms_health_check_results](../troubleshooting/interpreting_hms_health_check_results.md) documentation for more information.
508521
509522
Otherwise, run the HMS functional tests.
523+
510524
```
511525
ncn# /opt/cray/tests/ncn-resources/hms/hms-test/hms_run_ct_functional_tests_ncn-resources.sh
512526
```
513527
514528
Examine the output. If one or more failures occur, investigate the cause of each failure. See the [interpreting_hms_health_check_results](../troubleshooting/interpreting_hms_health_check_results.md) documentation for more information.
515529
530+
<a name="hms-aruba-fixup"></a>
531+
### 2.2 Aruba Switch SNMP Fixup
532+
533+
Systems with Aruba leaf switches sometimes have issues with a known SNMP bug
534+
which prevents HSM discovery from discovering all HW. At this stage of the
535+
installation process, a script can be run to detect if this issue is
536+
currently affecting the system, and if so, correct it.
537+
538+
Refer to [Air cooled hardware is not getting properly discovered with Aruba leaf switches](../troubleshooting/known_issues/discovery_aruba_snmp_issue.md) for
539+
details.
540+
516541
<a name="hms-smd-discovery-validation"></a>
517-
### 2.2 Hardware State Manager Discovery Validation
542+
### 2.3 Hardware State Manager Discovery Validation
518543
519544
By this point in the installation process, the Hardware State Manager (HSM) should
520545
have done its discovery of the system.
@@ -523,7 +548,8 @@ The foundational information for this discovery is from the System Layout Servic
523548
comparison needs to be done to see that what is specified in SLS (focusing on
524549
BMC components and Redfish endpoints) are present in HSM.
525550
526-
Execute the `verify_hsm_discovery.py` script on a Kubernetes master or worker NCN:
551+
To perform this comparison execute the `verify_hsm_discovery.py` script on a Kubernetes master or worker NCN. The result is pass/fail (returns 0 or non-zero):
552+
527553
```
528554
ncn# /opt/cray/csm/scripts/hms_verification/verify_hsm_discovery.py
529555
```
@@ -737,7 +763,7 @@ Expected output is similar to the following:
737763
"path": "s3://boot-images/293b1e9c-2bc4-4225-b235-147d1d611eef/manifest.json",
738764
"type": "s3"
739765
},
740-
"name": "cray-shasta-csm-sles15sp1-barebones.x86_64-shasta-PRODUCT_VERSION"
766+
"name": "cray-shasta-csm-sles15sp1-barebones.x86_64-shasta-1.4"
741767
}
742768
```
743769
@@ -769,27 +795,23 @@ The session template below can be copied and used as the basis for the BOS Sessi
769795
"type": "s3"
770796
}
771797
},
798+
"cfs": {
799+
"configuration": "cos-integ-config-1.4.0"
800+
},
772801
"enable_cfs": false,
773-
"name": "shasta-PRODUCT_VERSION-csm-bare-bones-image"
802+
"name": "shasta-1.4-csm-bare-bones-image"
774803
}
775-
```
776804
777805
**NOTE**: Be sure to replace the values of the `etag` and `path` fields with the ones you noted earlier in the `cray ims images list` command.
778806
779-
**NOTE**: The rootfs provider shown above references the `dvs` provider. DVS is not provided as part of the CSM
780-
distribution and is not expected to work until the COS product is installed and configured. As noted above, the
781-
barebones image is not expected to boot at this time. Work is being done to enable a fully functional and bootable
782-
barebones image in a future release of the CSM product. Until that work is complete, the use of the `dvs` rootfs
783-
provider is suggested.
784807
785-
from Redeploy Pit Node section)
786808
2. Create the BOS session template using the following file as input:
787809
```
788-
ncn# cray bos sessiontemplate create --file sessiontemplate.json --name shasta-PRODUCT_VERSION-csm-bare-bones-image
810+
ncn# cray bos sessiontemplate create --file sessiontemplate.json --name shasta-1.4-csm-bare-bones-image
789811
```
790812
The expected output is:
791813
```
792-
/sessionTemplate/shasta-PRODUCT_VERSION-csm-bare-bones-image
814+
/sessionTemplate/shasta-1.4-csm-bare-bones-image
793815
```
794816
795817
<a name="csm-node"></a>
@@ -838,14 +860,14 @@ ncn# export XNAME=x3000c0s17b2n0
838860
839861
Create a BOS session to reboot the chosen node using the BOS session template that was created:
840862
```bash
841-
ncn# cray bos session create --template-uuid shasta-PRODUCT_VERSION-csm-bare-bones-image --operation reboot --limit $XNAME
863+
ncn# cray bos session create --template-uuid shasta-1.4-csm-bare-bones-image --operation reboot --limit $XNAME
842864
```
843865
844866
Expected output looks similar to the following:
845867
```
846868
limit = "x3000c0s17b2n0"
847869
operation = "reboot"
848-
templateUuid = "shasta-PRODUCT_VERSION-csm-bare-bones-image"
870+
templateUuid = "shasta-1.4-csm-bare-bones-image"
849871
[[links]]
850872
href = "/v1/session/8f2fc013-7817-4fe2-8e6f-c2136a5e3bd1"
851873
jobId = "boa-8f2fc013-7817-4fe2-8e6f-c2136a5e3bd1"

troubleshooting/known_issues/discovery_aruba_snmp_issue.md

+88-2
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,93 @@
55
- Air cooled hardware is reported to not be present under State Components and Inventory Redfish Endpoints in Hardware State Manager by the hsm_discovery_verify.sh script.
66
- BMCs have IP addresses given out by DHCP, but in DNS their xname hostname does not resolve.
77

8-
## Procedure to determine if you affected by this known issue:
8+
## Procedure to determine if you affected by this known issue
9+
10+
Run the `arubafix.sh` script. Executed with no arguments, this script will
11+
see if this problem currently exists on the system, and if so, fix it.
12+
13+
**NOTE: This script requires the admin to enter the Aruba switch management interface's admin password.** This script is thus not completely automatic.
14+
15+
```
16+
Usage: arubafix.sh [-h] [-d] [-t]
17+
18+
-h Help text
19+
-d Print debug info during execution.
20+
-t Test mode, don't touch Aruba switches.
21+
```
22+
23+
#### Multiple Runs Are Potentially Needed
24+
25+
This script needs to be run twice if the first run finds issues; if it finds no issues, one run is sufficient.
26+
27+
The first run is to check for the issue and fix issues it finds; the second run is to verify that the issue is fixed.
28+
29+
If two runs are needed, sufficient time needs to be allowed between runs for an HSM discovery job to run. This job runs every five minutes; thus the admin should wait at least 6 minutes to be sure the discovery job has run before running a second execution.
30+
31+
Example:
32+
33+
```
34+
ncn# /opt/cray/csm/scripts/hms_verification/arubafix.sh
35+
36+
37+
==> Getting Aruba leaf switch info from SLS...
38+
39+
==> Fetching switch hostnames...
40+
==> Looking for completed HMS discovery pod...
41+
42+
==> Looking for undiscovered MAC addrs in discovery log...
43+
44+
Found unknown/undiscovered MACs in discovery log.
45+
46+
==> Looking for unknown/undiscovered MAC addrs in discovery log...
47+
48+
==> Identifying undiscovered MAC mismatches...
49+
50+
============================================
51+
= Aruba undiscovered MAC mismatches found! =
52+
= Performing switch SNMP resets. =
53+
============================================
54+
55+
==> Applying SNMP reset to Aruba switches...
56+
57+
==> PASSWORD REQUIRED for Aruba access. Enter Password:
58+
59+
Performing SNMP Reset on Aruba leaf switch: sw-leaf-001
60+
61+
Aruba switch sw-leaf-001 SNMP reset succeeded.
62+
63+
ncn#
64+
```
65+
66+
Since the previous run in this example found issues and fixed them, wait at least 6 minutes and run again to verify the fixes corrected the issue.
67+
68+
```
69+
ncn# /opt/cray/csm/scripts/hms_verification/arubafix.sh
70+
71+
==> Getting Aruba leaf switch info from SLS...
72+
73+
==> Fetching switch hostnames...
74+
==> Looking for completed HMS discovery pod...
75+
76+
==> Looking for undiscovered MAC addrs in discovery log...
77+
78+
Found unknown/undiscovered MACs in discovery log.
79+
80+
==> Looking for unknown/undiscovered MAC addrs in discovery log...
81+
82+
==> Identifying undiscovered MAC mismatches...
83+
84+
============================
85+
= No Aruba MAC mismatches. =
86+
============================
87+
88+
ncn#
89+
```
90+
91+
The script returns 0 if all went well, non-zero if there was a problem, in which case the admin should examine the system manually.
92+
93+
### Manual Procedure
94+
995
1. Determine the name of the last HSM discovery job that ran.
1096
```bash
1197
ncn# HMS_DISCOVERY_POD=$(kubectl -n services get pods -l app=hms-discovery | tail -n 1 | awk '{ print $1 }')
@@ -46,4 +132,4 @@
46132

47133
If there are any MAC addresses on the left column that are not on the right column, then it is likely the leaf switches in the system are being affected by the SNMP issue. Apply the workaround described in [the following procedure](../../install/aruba_snmp_known_issue_10_06_0010.md) to the Aruba leaf switches in the system.
48134

49-
If all of the MAC addresses on the left column are present in the right column, then you are not affected by this known issue.
135+
If all of the MAC addresses on the left column are present in the right column, then you are not affected by this known issue.

0 commit comments

Comments
 (0)