Bare Metal Manager currently manages the health for managed hosts:
We need a similar system to manage health for additional components that are managed by Bare Metal Manager:
- NvSwitches
- Powershelves
- Racks
For each of these components
- Carbide should offer a similar looking set of APIs, e.g.
InsertRackHealthReportOverrides, InsertSwitchHealthReportOverrides, etc.
- the aggregate health should be visible for callers which load the component state (
message Switch/Rack/etc)
- The aggregate health, health history, and health override tools should be visible on admin CLI and admin web UI
In addition that we want to have the ability where the health of components within the rack directly affects all machines in the rack. E.g. "if any nvswitch in the Rack is unhealthy with PreventAllocations classification, then all hosts in the rack should show up as unhealthy with the same classification.
Implementing this mechanism will allow users to only monitor the health of hosts when deciding which hosts to use as instances.
Bare Metal Manager currently manages the health for managed hosts:
Machine::healthon gRPC.ListHealthReportOverridesInsertHealthReportOverrideRemoveHealthReportOverrideWe need a similar system to manage health for additional components that are managed by Bare Metal Manager:
For each of these components
InsertRackHealthReportOverrides,InsertSwitchHealthReportOverrides, etc.message Switch/Rack/etc)In addition that we want to have the ability where the health of components within the rack directly affects all machines in the rack. E.g. "if any nvswitch in the Rack is unhealthy with
PreventAllocationsclassification, then all hosts in the rack should show up as unhealthy with the same classification.Implementing this mechanism will allow users to only monitor the health of hosts when deciding which hosts to use as instances.