add alerting tenant + batching improvements oep

rancher · Feb 23, 2023 · 25f716d · 25f716d
1 parent bda0139
commit 25f716d
Show file tree

Hide file tree

Showing 2 changed files with 317 additions and 0 deletions.
diff --git a/enhancements/alerting/2023-02-21-alerting-batching-tenant.md b/enhancements/alerting/2023-02-21-alerting-batching-tenant.md
@@ -0,0 +1,317 @@
+# Title
+
+Alerting CRUD batching configurations, aimed at multi-tenant usecases
+
+## Summary
+
+Currently, alerting configurations must be created one-by-one in the UI. The alerting plugin should allow for ways to batch apply configurations and allow setting default batch templates for clusters for ease of use.
+
+## Use Case
+
+- End-users want to manually import a large amount of configurations for Opni-Alarms
+- End-users want ways to batch connect conditions to endpoints
+- End-users want to set defaults for all their clusters or particular clusters
+
+## Benefits
+
+- Improved UX
+- Better management of Opni-Configurations
+
+## Impact
+
+- Rework of Alerting protocul buffers used to store user configs
+- Additional logic for opni routing interface
+
+## Acceptance criteria
+
+- [ ] Batch import raw configurations (Prometheus Rule Groups) to Opni-Alerting
+
+- [ ] Support batch routing using arbitrary alertmanager label matchers
+
+- [ ] Persist Tenant Configurations (`TC`) that specify batches of alarms
+  - [ ] Support a global `TC`
+  - [ ] Allow for `TC` to be scoped to particular label matchers or cluster ids
+  - [ ] Management watcher applies/deletes these defaults when clusters that match the labels are created, updated or deleted or when the `TC` changes
+
+**Important** `TC` will only initially support `PrometheusQuery` AlertConditions
+
+- [ ] Alerting CLI for all alerting APIs for a more seamless batch update experience
+
+### UI/UX
+
+- [ ] Use v2 APIs
+
+  - [ ] when listing `Endpoint/Alarm` in table format, show routing labels (& severity for `Alarms`)
+
+- [ ] Import prometheus rule groups to opni alerting
+
+  - [ ] Button in Alarm page called `Import Prometheus`
+  - [ ] API returns a list
+  - [ ] Users select what clusters to apply these to
+  - [ ] UI will have to individually create the Alarms
+
+- [ ] Admin install page tenant template tab which lists, creates, updates, deletes `TC`
+
+![](../../images/alerting/tenant-tab.png)
+
+- [ ] Pasting raw prometheus rule group in tenant template tab invokes `ConvertPrometheusRuleGroup` to add them to `TC`
+
+## Implementation Details
+
+### Batch import configurations
+
+```proto
+
+message AlertConditionsList {
+    repeated AlertCondition items = 1;
+}
+
+service AlertingConditions {
+    //!! Does not create these rules, only converts them to the opni format
+    rpc ConvertPrometheusRuleGroup(RawPrometheusRuleGroup) returns (AlertConditionsList){}
+}
+
+message RawPrometheusRuleGroup{
+  bytes data = 1;
+}
+```
+
+### Batch Routing
+
+- Require a new versioning of the API `alertingv2`
+- !! always generate OpenAPI spec for each service
+
+```proto
+service AlarmConfig{
+  // Only CRUD Alarms
+  rpc CreateAlarm(...)returns(...){}
+  rpc GetAlarm(...)returns(...){}
+  rpc ListAlarms(...)returns(...){}
+  rpc UpdateAlarm(...)returns(...){}
+  rpc DeleteAlarm(...)returns(...){}
+  rpc CloneAlarm(...)returns(...){}
+
+  //
+}
+
+service EndpointConfig{
+  // Only CRUD endpoints
+  rpc CreateEndpoint(...)returns(...){}
+  rpc GetEndpoint(...)returns(...){}
+  rpc ListEndpoint(...)returns(...){}
+  rpc UpdateEndpoint(...)returns(...){}
+  rpc DeleteEndpoint(...)returns(...){}
+  rpc ToggleNotifications(...) returns (...) {}
+}
+
+// see below
+service TenantConfig{
+  rpc CreateTenantConfig(...)returns(...){}
+  rpc GetTenantConfig(...)returns(...){}
+  rpc ListTenantConfig(...)returns(...){}
+  rpc UpdateTenantConfig(...)returns(...){}
+  rpc DeleteTenantConfig(...)returns(...){}
+
+// metadata aggregator for routing
+service RoutingConfig{
+  rpc ListRoutingRelationships(...) returns(...){}
+  //...
+  // Eventually can look at explictly setting tenant routing rules here
+  // for plain notifications, or enforce other rules we come up with
+}
+
+
+service NotificationService{
+  rpc TestAlertEndpoint(...) returns(...){}
+  rpc SendAlert(...) returns(...){}
+  rpc ResolveAlert(...) returns(...){}
+  rpc SendNotification(...) returns(...){}
+
+  rpc GetAlarmStatus(...) returns(...) {}
+  rpc ListAlarmsWithStatus(...) returns(...) {}
+  rpc ActivateSilence(...) returns(...)  {}
+  rpc DeactivateSilence(...) returns (...) {}
+  rpc Timeline(...) returns(...) {}
+}
+
+
+service RuntimeService{
+  rpc ConnectRemoteSyncer(...) returns (stream ...) {}
+}
+```
+
+- Aim to physically separate in alerting gateway plugin encapsulation logic for
+  - config CRUD accross services to a `MetadataServer`
+  - runtime states & dependency management to a `ManagementServer`
+
+```go
+type MetadataServer struct{
+  alertingv2.UnsafeAlarmConfig
+  alertingv2.UnsafeEndpointConfig
+  alertingv2.UnsafeRoutingConfig
+  alertingv2.UnsafeTenantConfig
+}
+```
+
+```go
+type ManagementServer struct {
+  alertingv2.UnsafeNotificationService
+  alertingv1.UnsafeRuntimeService
+}
+```
+
+### Protobuf changes
+
+- move `AlertCondition` messages to `Alarm` messages in v2 apis
+
+- `Alarm` should implement an equality comparer interface, which handles deduping equal alarms when multiple tenant configs
+  attempt to apply the same spec
+
+```go
+var _ util.EqualityComparer = (*Alarm)(nil)
+```
+
+```proto
+// v2
+message Alarm {
+  string id = 1;
+  string name = 2;
+  string description = 3;
+  core.Reference clusterId = 4;
+  // OpniSeverity is required by the Opni system
+  OpniSeverity severity = 5;
+  // hold implementation details
+  //
+  // - rate limiting configs
+  // - send-resolved=yes/no
+  // - last-updated
+  // - silence-info
+  map<string,string> properties = 6;
+
+  // hold message content variables from Alertmanager format
+  //
+  // Body & Description are now set here using specific Keys
+  //
+  // Users should be able to add their own custom annotations from the UI as well
+  map<string,string> annotations = 7;
+
+  // labels for custom routing
+  map<string,string> routingLabels = 8;
+
+  // condition spec
+  AlertTypeDetails alertType = 9;
+  core.ReferenceList attachedEndpoints = 10;
+}
+```
+
+- move `AlertEndpoint` messages to `Endpoint` messages in v2 apis
+
+```proto
+// v2
+message Endpoint {
+  string id = 1;
+  string name = 2;
+  string description = 3;
+  // Holds implementation details
+  //
+  // - holds receive-notification = on/off
+  // - holds last updated time
+  // - default/fallback rate limiting config when using label matchers
+  map<string,string> properties = 4;
+  // custom routing labels.
+  map<string,string> labels = 5;
+  oneof endpoint {
+    SlackEndpoint slack = 6;
+    EmailEndpoint email = 7;
+    PagerDutyEndpoint pagerDuty = 8;
+    WebhookEndpoint webhook = 9;
+  }
+}
+```
+
+### Alerting Storage Clientset migration enhancement
+
+Since we will be releasing a new version of the API the storage clientset needs to migrate buckets from
+one object type to another, for example : `alertingv1.AlertCondition` -> `alertingv2.Alarm`
+
+- Have a seperate buckets for each relevant alertingv2. type objects
+- The alerting storage client set interfaces should be changed to use `proto.Message`
+- When we invoke `Use` on start-up for the clientset to use reflection to check contents of each bucket version, then migrate old buckets to new buckets
+
+### CRUD BatchTenantConfiguration API(s)
+
+```proto
+// Also referred to as BTC
+message BatchTenantConfiguration {
+    // opaque
+    string id = 1;
+    // name the user sets
+    string name = 2;
+    // manually selected clusters to apply this template to
+    core.ReferenceList clusters = 3;
+    // apply this template to any clusters matching these labels
+    map<string,string> clusterLabelMatchers = 4;
+    // alarms to apply by default to this
+    AlertConditionList conditions = 5;
+}
+
+message BatchTenantConfigurationList{
+    repeated items BatchTenantConfiguration = 1;
+}
+
+
+service TenantConfigs {
+    // Tenant template
+    rpc ListTenantConfig(google.protobuf.Empty) returns (BatchTenantConfigurationList){}
+
+    rpc CreateTenantConfig(BatchTenantConfiguration) returns (google.protobuf.Empty){}
+
+    rpc GetTenantConfig(core.Reference) returns (BatchTenantConfiguration){}
+
+    rpc UpdateTenantConfig(BatchTenantConfiguration) returns google.protobuf.Empty{}
+
+    rpc DeleteTenantConfig(BatchTenantConfiguration) returns google.protobuf.Empty{}
+}
+
+```
+
+### UI/UX
+
+See [acceptance criteria](#acceptance-criteria)
+
+## Supporting documents
+
+- Generic Prometheus Rule group format : https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
+- AlertManager label matchers format : https://github.com/prometheus/alertmanager/blob/fd0929ba9fc58737a9c91f24771862692fa72d17/pkg/labels/matcher.go
+
+## Risks and contingencies:
+
+N/A
+
+## Level of Effort:
+
+### Backend
+
+Approximately 5-6 weeks
+
+- 3 weeks v2 api
+
+  - 1 week API apis
+  - 1 week support batch routing
+  - 1 week cli
+
+- 1-2 days batch prometheus rule group (requires v2 -- annotations and label matchers should be imported as well)
+
+- 1 1/2 weeks CRUD Tenant configurations (requires v2 -- significant internal rework of v2 API will impact implementation details of tenant configs)
+
+  - 1 week API implementation
+  - 2-3 days testing
+
+### UI
+
+??? days
+
+## Resources:
+
+- git alerting staging branch
+- 1 Opni Upstream & 1 Opni Downstream cluster
diff --git a/images/alerting/tenant-tab.png b/images/alerting/tenant-tab.png