Skip to content

Health management APIs for switches, powershelves and racks #240

Description

@Matthias247

Bare Metal Manager currently manages the health for managed hosts:

We need a similar system to manage health for additional components that are managed by Bare Metal Manager:

  • NvSwitches
  • Powershelves
  • Racks

For each of these components

  • Carbide should offer a similar looking set of APIs, e.g. InsertRackHealthReportOverrides, InsertSwitchHealthReportOverrides, etc.
  • the aggregate health should be visible for callers which load the component state (message Switch/Rack/etc)
  • The aggregate health, health history, and health override tools should be visible on admin CLI and admin web UI

In addition that we want to have the ability where the health of components within the rack directly affects all machines in the rack. E.g. "if any nvswitch in the Rack is unhealthy with PreventAllocations classification, then all hosts in the rack should show up as unhealthy with the same classification.

Implementing this mechanism will allow users to only monitor the health of hosts when deciding which hosts to use as instances.

Metadata

Metadata

Assignees

Labels

featureFeature (deprecated - use issue type, but it's needed for reporting now)rack lifecycleIssues that relate to managing the lifecycle of a full rack (compute, switches and powershelves)roadmapRoadmap item with program-level tracking

Type

No fields configured for Epic.

Projects

Status
Verify

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions