Skip to content

Commit

Permalink
add alerting tenant + batching improvements oep
Browse files Browse the repository at this point in the history
  • Loading branch information
alexandreLamarre committed Feb 23, 2023
1 parent bda0139 commit 25f716d
Show file tree
Hide file tree
Showing 2 changed files with 317 additions and 0 deletions.
317 changes: 317 additions & 0 deletions enhancements/alerting/2023-02-21-alerting-batching-tenant.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,317 @@
# Title

Alerting CRUD batching configurations, aimed at multi-tenant usecases

## Summary

Currently, alerting configurations must be created one-by-one in the UI. The alerting plugin should allow for ways to batch apply configurations and allow setting default batch templates for clusters for ease of use.

## Use Case

- End-users want to manually import a large amount of configurations for Opni-Alarms
- End-users want ways to batch connect conditions to endpoints
- End-users want to set defaults for all their clusters or particular clusters

## Benefits

- Improved UX
- Better management of Opni-Configurations

## Impact

- Rework of Alerting protocul buffers used to store user configs
- Additional logic for opni routing interface

## Acceptance criteria

- [ ] Batch import raw configurations (Prometheus Rule Groups) to Opni-Alerting

- [ ] Support batch routing using arbitrary alertmanager label matchers

- [ ] Persist Tenant Configurations (`TC`) that specify batches of alarms
- [ ] Support a global `TC`
- [ ] Allow for `TC` to be scoped to particular label matchers or cluster ids
- [ ] Management watcher applies/deletes these defaults when clusters that match the labels are created, updated or deleted or when the `TC` changes

**Important** `TC` will only initially support `PrometheusQuery` AlertConditions

- [ ] Alerting CLI for all alerting APIs for a more seamless batch update experience

### UI/UX

- [ ] Use v2 APIs

- [ ] when listing `Endpoint/Alarm` in table format, show routing labels (& severity for `Alarms`)

- [ ] Import prometheus rule groups to opni alerting

- [ ] Button in Alarm page called `Import Prometheus`
- [ ] API returns a list
- [ ] Users select what clusters to apply these to
- [ ] UI will have to individually create the Alarms

- [ ] Admin install page tenant template tab which lists, creates, updates, deletes `TC`

![](../../images/alerting/tenant-tab.png)

- [ ] Pasting raw prometheus rule group in tenant template tab invokes `ConvertPrometheusRuleGroup` to add them to `TC`

## Implementation Details

### Batch import configurations

```proto
message AlertConditionsList {
repeated AlertCondition items = 1;
}
service AlertingConditions {
//!! Does not create these rules, only converts them to the opni format
rpc ConvertPrometheusRuleGroup(RawPrometheusRuleGroup) returns (AlertConditionsList){}
}
message RawPrometheusRuleGroup{
bytes data = 1;
}
```

### Batch Routing

- Require a new versioning of the API `alertingv2`
- !! always generate OpenAPI spec for each service

```proto
service AlarmConfig{
// Only CRUD Alarms
rpc CreateAlarm(...)returns(...){}
rpc GetAlarm(...)returns(...){}
rpc ListAlarms(...)returns(...){}
rpc UpdateAlarm(...)returns(...){}
rpc DeleteAlarm(...)returns(...){}
rpc CloneAlarm(...)returns(...){}
//
}
service EndpointConfig{
// Only CRUD endpoints
rpc CreateEndpoint(...)returns(...){}
rpc GetEndpoint(...)returns(...){}
rpc ListEndpoint(...)returns(...){}
rpc UpdateEndpoint(...)returns(...){}
rpc DeleteEndpoint(...)returns(...){}
rpc ToggleNotifications(...) returns (...) {}
}
// see below
service TenantConfig{
rpc CreateTenantConfig(...)returns(...){}
rpc GetTenantConfig(...)returns(...){}
rpc ListTenantConfig(...)returns(...){}
rpc UpdateTenantConfig(...)returns(...){}
rpc DeleteTenantConfig(...)returns(...){}
// metadata aggregator for routing
service RoutingConfig{
rpc ListRoutingRelationships(...) returns(...){}
//...
// Eventually can look at explictly setting tenant routing rules here
// for plain notifications, or enforce other rules we come up with
}
service NotificationService{
rpc TestAlertEndpoint(...) returns(...){}
rpc SendAlert(...) returns(...){}
rpc ResolveAlert(...) returns(...){}
rpc SendNotification(...) returns(...){}
rpc GetAlarmStatus(...) returns(...) {}
rpc ListAlarmsWithStatus(...) returns(...) {}
rpc ActivateSilence(...) returns(...) {}
rpc DeactivateSilence(...) returns (...) {}
rpc Timeline(...) returns(...) {}
}
service RuntimeService{
rpc ConnectRemoteSyncer(...) returns (stream ...) {}
}
```

- Aim to physically separate in alerting gateway plugin encapsulation logic for
- config CRUD accross services to a `MetadataServer`
- runtime states & dependency management to a `ManagementServer`

```go
type MetadataServer struct{
alertingv2.UnsafeAlarmConfig
alertingv2.UnsafeEndpointConfig
alertingv2.UnsafeRoutingConfig
alertingv2.UnsafeTenantConfig
}
```

```go
type ManagementServer struct {
alertingv2.UnsafeNotificationService
alertingv1.UnsafeRuntimeService
}
```

### Protobuf changes

- move `AlertCondition` messages to `Alarm` messages in v2 apis

- `Alarm` should implement an equality comparer interface, which handles deduping equal alarms when multiple tenant configs
attempt to apply the same spec

```go
var _ util.EqualityComparer = (*Alarm)(nil)
```

```proto
// v2
message Alarm {
string id = 1;
string name = 2;
string description = 3;
core.Reference clusterId = 4;
// OpniSeverity is required by the Opni system
OpniSeverity severity = 5;
// hold implementation details
//
// - rate limiting configs
// - send-resolved=yes/no
// - last-updated
// - silence-info
map<string,string> properties = 6;
// hold message content variables from Alertmanager format
//
// Body & Description are now set here using specific Keys
//
// Users should be able to add their own custom annotations from the UI as well
map<string,string> annotations = 7;
// labels for custom routing
map<string,string> routingLabels = 8;
// condition spec
AlertTypeDetails alertType = 9;
core.ReferenceList attachedEndpoints = 10;
}
```

- move `AlertEndpoint` messages to `Endpoint` messages in v2 apis

```proto
// v2
message Endpoint {
string id = 1;
string name = 2;
string description = 3;
// Holds implementation details
//
// - holds receive-notification = on/off
// - holds last updated time
// - default/fallback rate limiting config when using label matchers
map<string,string> properties = 4;
// custom routing labels.
map<string,string> labels = 5;
oneof endpoint {
SlackEndpoint slack = 6;
EmailEndpoint email = 7;
PagerDutyEndpoint pagerDuty = 8;
WebhookEndpoint webhook = 9;
}
}
```

### Alerting Storage Clientset migration enhancement

Since we will be releasing a new version of the API the storage clientset needs to migrate buckets from
one object type to another, for example : `alertingv1.AlertCondition` -> `alertingv2.Alarm`

- Have a seperate buckets for each relevant alertingv2. type objects
- The alerting storage client set interfaces should be changed to use `proto.Message`
- When we invoke `Use` on start-up for the clientset to use reflection to check contents of each bucket version, then migrate old buckets to new buckets

### CRUD BatchTenantConfiguration API(s)

```proto
// Also referred to as BTC
message BatchTenantConfiguration {
// opaque
string id = 1;
// name the user sets
string name = 2;
// manually selected clusters to apply this template to
core.ReferenceList clusters = 3;
// apply this template to any clusters matching these labels
map<string,string> clusterLabelMatchers = 4;
// alarms to apply by default to this
AlertConditionList conditions = 5;
}
message BatchTenantConfigurationList{
repeated items BatchTenantConfiguration = 1;
}
service TenantConfigs {
// Tenant template
rpc ListTenantConfig(google.protobuf.Empty) returns (BatchTenantConfigurationList){}
rpc CreateTenantConfig(BatchTenantConfiguration) returns (google.protobuf.Empty){}
rpc GetTenantConfig(core.Reference) returns (BatchTenantConfiguration){}
rpc UpdateTenantConfig(BatchTenantConfiguration) returns google.protobuf.Empty{}
rpc DeleteTenantConfig(BatchTenantConfiguration) returns google.protobuf.Empty{}
}
```

### UI/UX

See [acceptance criteria](#acceptance-criteria)

## Supporting documents

- Generic Prometheus Rule group format : https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
- AlertManager label matchers format : https://github.com/prometheus/alertmanager/blob/fd0929ba9fc58737a9c91f24771862692fa72d17/pkg/labels/matcher.go

## Risks and contingencies:

N/A

## Level of Effort:

### Backend

Approximately 5-6 weeks

- 3 weeks v2 api

- 1 week API apis
- 1 week support batch routing
- 1 week cli

- 1-2 days batch prometheus rule group (requires v2 -- annotations and label matchers should be imported as well)

- 1 1/2 weeks CRUD Tenant configurations (requires v2 -- significant internal rework of v2 API will impact implementation details of tenant configs)

- 1 week API implementation
- 2-3 days testing

### UI

??? days

## Resources:

- git alerting staging branch
- 1 Opni Upstream & 1 Opni Downstream cluster
Binary file added images/alerting/tenant-tab.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 25f716d

Please sign in to comment.