diff --git a/enhancements/alerting/2023-02-21-alerting-batching-tenant.md b/enhancements/alerting/2023-02-21-alerting-batching-tenant.md new file mode 100644 index 0000000000..2f1def6960 --- /dev/null +++ b/enhancements/alerting/2023-02-21-alerting-batching-tenant.md @@ -0,0 +1,317 @@ +# Title + +Alerting CRUD batching configurations, aimed at multi-tenant usecases + +## Summary + +Currently, alerting configurations must be created one-by-one in the UI. The alerting plugin should allow for ways to batch apply configurations and allow setting default batch templates for clusters for ease of use. + +## Use Case + +- End-users want to manually import a large amount of configurations for Opni-Alarms +- End-users want ways to batch connect conditions to endpoints +- End-users want to set defaults for all their clusters or particular clusters + +## Benefits + +- Improved UX +- Better management of Opni-Configurations + +## Impact + +- Rework of Alerting protocul buffers used to store user configs +- Additional logic for opni routing interface + +## Acceptance criteria + +- [ ] Batch import raw configurations (Prometheus Rule Groups) to Opni-Alerting + +- [ ] Support batch routing using arbitrary alertmanager label matchers + +- [ ] Persist Tenant Configurations (`TC`) that specify batches of alarms + - [ ] Support a global `TC` + - [ ] Allow for `TC` to be scoped to particular label matchers or cluster ids + - [ ] Management watcher applies/deletes these defaults when clusters that match the labels are created, updated or deleted or when the `TC` changes + +**Important** `TC` will only initially support `PrometheusQuery` AlertConditions + +- [ ] Alerting CLI for all alerting APIs for a more seamless batch update experience + +### UI/UX + +- [ ] Use v2 APIs + + - [ ] when listing `Endpoint/Alarm` in table format, show routing labels (& severity for `Alarms`) + +- [ ] Import prometheus rule groups to opni alerting + + - [ ] Button in Alarm page called `Import Prometheus` + - [ ] API returns a list + - [ ] Users select what clusters to apply these to + - [ ] UI will have to individually create the Alarms + +- [ ] Admin install page tenant template tab which lists, creates, updates, deletes `TC` + +![](../../images/alerting/tenant-tab.png) + +- [ ] Pasting raw prometheus rule group in tenant template tab invokes `ConvertPrometheusRuleGroup` to add them to `TC` + +## Implementation Details + +### Batch import configurations + +```proto + +message AlertConditionsList { + repeated AlertCondition items = 1; +} + +service AlertingConditions { + //!! Does not create these rules, only converts them to the opni format + rpc ConvertPrometheusRuleGroup(RawPrometheusRuleGroup) returns (AlertConditionsList){} +} + +message RawPrometheusRuleGroup{ + bytes data = 1; +} +``` + +### Batch Routing + +- Require a new versioning of the API `alertingv2` +- !! always generate OpenAPI spec for each service + +```proto +service AlarmConfig{ + // Only CRUD Alarms + rpc CreateAlarm(...)returns(...){} + rpc GetAlarm(...)returns(...){} + rpc ListAlarms(...)returns(...){} + rpc UpdateAlarm(...)returns(...){} + rpc DeleteAlarm(...)returns(...){} + rpc CloneAlarm(...)returns(...){} + + // +} + +service EndpointConfig{ + // Only CRUD endpoints + rpc CreateEndpoint(...)returns(...){} + rpc GetEndpoint(...)returns(...){} + rpc ListEndpoint(...)returns(...){} + rpc UpdateEndpoint(...)returns(...){} + rpc DeleteEndpoint(...)returns(...){} + rpc ToggleNotifications(...) returns (...) {} +} + +// see below +service TenantConfig{ + rpc CreateTenantConfig(...)returns(...){} + rpc GetTenantConfig(...)returns(...){} + rpc ListTenantConfig(...)returns(...){} + rpc UpdateTenantConfig(...)returns(...){} + rpc DeleteTenantConfig(...)returns(...){} + +// metadata aggregator for routing +service RoutingConfig{ + rpc ListRoutingRelationships(...) returns(...){} + //... + // Eventually can look at explictly setting tenant routing rules here + // for plain notifications, or enforce other rules we come up with +} + + +service NotificationService{ + rpc TestAlertEndpoint(...) returns(...){} + rpc SendAlert(...) returns(...){} + rpc ResolveAlert(...) returns(...){} + rpc SendNotification(...) returns(...){} + + rpc GetAlarmStatus(...) returns(...) {} + rpc ListAlarmsWithStatus(...) returns(...) {} + rpc ActivateSilence(...) returns(...) {} + rpc DeactivateSilence(...) returns (...) {} + rpc Timeline(...) returns(...) {} +} + + +service RuntimeService{ + rpc ConnectRemoteSyncer(...) returns (stream ...) {} +} +``` + +- Aim to physically separate in alerting gateway plugin encapsulation logic for + - config CRUD accross services to a `MetadataServer` + - runtime states & dependency management to a `ManagementServer` + +```go +type MetadataServer struct{ + alertingv2.UnsafeAlarmConfig + alertingv2.UnsafeEndpointConfig + alertingv2.UnsafeRoutingConfig + alertingv2.UnsafeTenantConfig +} +``` + +```go +type ManagementServer struct { + alertingv2.UnsafeNotificationService + alertingv1.UnsafeRuntimeService +} +``` + +### Protobuf changes + +- move `AlertCondition` messages to `Alarm` messages in v2 apis + +- `Alarm` should implement an equality comparer interface, which handles deduping equal alarms when multiple tenant configs + attempt to apply the same spec + +```go +var _ util.EqualityComparer = (*Alarm)(nil) +``` + +```proto +// v2 +message Alarm { + string id = 1; + string name = 2; + string description = 3; + core.Reference clusterId = 4; + // OpniSeverity is required by the Opni system + OpniSeverity severity = 5; + // hold implementation details + // + // - rate limiting configs + // - send-resolved=yes/no + // - last-updated + // - silence-info + map properties = 6; + + // hold message content variables from Alertmanager format + // + // Body & Description are now set here using specific Keys + // + // Users should be able to add their own custom annotations from the UI as well + map annotations = 7; + + // labels for custom routing + map routingLabels = 8; + + // condition spec + AlertTypeDetails alertType = 9; + core.ReferenceList attachedEndpoints = 10; +} +``` + +- move `AlertEndpoint` messages to `Endpoint` messages in v2 apis + +```proto +// v2 +message Endpoint { + string id = 1; + string name = 2; + string description = 3; + // Holds implementation details + // + // - holds receive-notification = on/off + // - holds last updated time + // - default/fallback rate limiting config when using label matchers + map properties = 4; + // custom routing labels. + map labels = 5; + oneof endpoint { + SlackEndpoint slack = 6; + EmailEndpoint email = 7; + PagerDutyEndpoint pagerDuty = 8; + WebhookEndpoint webhook = 9; + } +} +``` + +### Alerting Storage Clientset migration enhancement + +Since we will be releasing a new version of the API the storage clientset needs to migrate buckets from +one object type to another, for example : `alertingv1.AlertCondition` -> `alertingv2.Alarm` + +- Have a seperate buckets for each relevant alertingv2. type objects +- The alerting storage client set interfaces should be changed to use `proto.Message` +- When we invoke `Use` on start-up for the clientset to use reflection to check contents of each bucket version, then migrate old buckets to new buckets + +### CRUD BatchTenantConfiguration API(s) + +```proto +// Also referred to as BTC +message BatchTenantConfiguration { + // opaque + string id = 1; + // name the user sets + string name = 2; + // manually selected clusters to apply this template to + core.ReferenceList clusters = 3; + // apply this template to any clusters matching these labels + map clusterLabelMatchers = 4; + // alarms to apply by default to this + AlertConditionList conditions = 5; +} + +message BatchTenantConfigurationList{ + repeated items BatchTenantConfiguration = 1; +} + + +service TenantConfigs { + // Tenant template + rpc ListTenantConfig(google.protobuf.Empty) returns (BatchTenantConfigurationList){} + + rpc CreateTenantConfig(BatchTenantConfiguration) returns (google.protobuf.Empty){} + + rpc GetTenantConfig(core.Reference) returns (BatchTenantConfiguration){} + + rpc UpdateTenantConfig(BatchTenantConfiguration) returns google.protobuf.Empty{} + + rpc DeleteTenantConfig(BatchTenantConfiguration) returns google.protobuf.Empty{} +} + +``` + +### UI/UX + +See [acceptance criteria](#acceptance-criteria) + +## Supporting documents + +- Generic Prometheus Rule group format : https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/ +- AlertManager label matchers format : https://github.com/prometheus/alertmanager/blob/fd0929ba9fc58737a9c91f24771862692fa72d17/pkg/labels/matcher.go + +## Risks and contingencies: + +N/A + +## Level of Effort: + +### Backend + +Approximately 5-6 weeks + +- 3 weeks v2 api + + - 1 week API apis + - 1 week support batch routing + - 1 week cli + +- 1-2 days batch prometheus rule group (requires v2 -- annotations and label matchers should be imported as well) + +- 1 1/2 weeks CRUD Tenant configurations (requires v2 -- significant internal rework of v2 API will impact implementation details of tenant configs) + + - 1 week API implementation + - 2-3 days testing + +### UI + +??? days + +## Resources: + +- git alerting staging branch +- 1 Opni Upstream & 1 Opni Downstream cluster diff --git a/images/alerting/tenant-tab.png b/images/alerting/tenant-tab.png new file mode 100644 index 0000000000..0f8139c561 Binary files /dev/null and b/images/alerting/tenant-tab.png differ