-
Notifications
You must be signed in to change notification settings - Fork 57
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add alerting tenant + batching improvements oep
- Loading branch information
1 parent
bda0139
commit 25f716d
Showing
2 changed files
with
317 additions
and
0 deletions.
There are no files selected for viewing
317 changes: 317 additions & 0 deletions
317
enhancements/alerting/2023-02-21-alerting-batching-tenant.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,317 @@ | ||
# Title | ||
|
||
Alerting CRUD batching configurations, aimed at multi-tenant usecases | ||
|
||
## Summary | ||
|
||
Currently, alerting configurations must be created one-by-one in the UI. The alerting plugin should allow for ways to batch apply configurations and allow setting default batch templates for clusters for ease of use. | ||
|
||
## Use Case | ||
|
||
- End-users want to manually import a large amount of configurations for Opni-Alarms | ||
- End-users want ways to batch connect conditions to endpoints | ||
- End-users want to set defaults for all their clusters or particular clusters | ||
|
||
## Benefits | ||
|
||
- Improved UX | ||
- Better management of Opni-Configurations | ||
|
||
## Impact | ||
|
||
- Rework of Alerting protocul buffers used to store user configs | ||
- Additional logic for opni routing interface | ||
|
||
## Acceptance criteria | ||
|
||
- [ ] Batch import raw configurations (Prometheus Rule Groups) to Opni-Alerting | ||
|
||
- [ ] Support batch routing using arbitrary alertmanager label matchers | ||
|
||
- [ ] Persist Tenant Configurations (`TC`) that specify batches of alarms | ||
- [ ] Support a global `TC` | ||
- [ ] Allow for `TC` to be scoped to particular label matchers or cluster ids | ||
- [ ] Management watcher applies/deletes these defaults when clusters that match the labels are created, updated or deleted or when the `TC` changes | ||
|
||
**Important** `TC` will only initially support `PrometheusQuery` AlertConditions | ||
|
||
- [ ] Alerting CLI for all alerting APIs for a more seamless batch update experience | ||
|
||
### UI/UX | ||
|
||
- [ ] Use v2 APIs | ||
|
||
- [ ] when listing `Endpoint/Alarm` in table format, show routing labels (& severity for `Alarms`) | ||
|
||
- [ ] Import prometheus rule groups to opni alerting | ||
|
||
- [ ] Button in Alarm page called `Import Prometheus` | ||
- [ ] API returns a list | ||
- [ ] Users select what clusters to apply these to | ||
- [ ] UI will have to individually create the Alarms | ||
|
||
- [ ] Admin install page tenant template tab which lists, creates, updates, deletes `TC` | ||
|
||
 | ||
|
||
- [ ] Pasting raw prometheus rule group in tenant template tab invokes `ConvertPrometheusRuleGroup` to add them to `TC` | ||
|
||
## Implementation Details | ||
|
||
### Batch import configurations | ||
|
||
```proto | ||
message AlertConditionsList { | ||
repeated AlertCondition items = 1; | ||
} | ||
service AlertingConditions { | ||
//!! Does not create these rules, only converts them to the opni format | ||
rpc ConvertPrometheusRuleGroup(RawPrometheusRuleGroup) returns (AlertConditionsList){} | ||
} | ||
message RawPrometheusRuleGroup{ | ||
bytes data = 1; | ||
} | ||
``` | ||
|
||
### Batch Routing | ||
|
||
- Require a new versioning of the API `alertingv2` | ||
- !! always generate OpenAPI spec for each service | ||
|
||
```proto | ||
service AlarmConfig{ | ||
// Only CRUD Alarms | ||
rpc CreateAlarm(...)returns(...){} | ||
rpc GetAlarm(...)returns(...){} | ||
rpc ListAlarms(...)returns(...){} | ||
rpc UpdateAlarm(...)returns(...){} | ||
rpc DeleteAlarm(...)returns(...){} | ||
rpc CloneAlarm(...)returns(...){} | ||
// | ||
} | ||
service EndpointConfig{ | ||
// Only CRUD endpoints | ||
rpc CreateEndpoint(...)returns(...){} | ||
rpc GetEndpoint(...)returns(...){} | ||
rpc ListEndpoint(...)returns(...){} | ||
rpc UpdateEndpoint(...)returns(...){} | ||
rpc DeleteEndpoint(...)returns(...){} | ||
rpc ToggleNotifications(...) returns (...) {} | ||
} | ||
// see below | ||
service TenantConfig{ | ||
rpc CreateTenantConfig(...)returns(...){} | ||
rpc GetTenantConfig(...)returns(...){} | ||
rpc ListTenantConfig(...)returns(...){} | ||
rpc UpdateTenantConfig(...)returns(...){} | ||
rpc DeleteTenantConfig(...)returns(...){} | ||
// metadata aggregator for routing | ||
service RoutingConfig{ | ||
rpc ListRoutingRelationships(...) returns(...){} | ||
//... | ||
// Eventually can look at explictly setting tenant routing rules here | ||
// for plain notifications, or enforce other rules we come up with | ||
} | ||
service NotificationService{ | ||
rpc TestAlertEndpoint(...) returns(...){} | ||
rpc SendAlert(...) returns(...){} | ||
rpc ResolveAlert(...) returns(...){} | ||
rpc SendNotification(...) returns(...){} | ||
rpc GetAlarmStatus(...) returns(...) {} | ||
rpc ListAlarmsWithStatus(...) returns(...) {} | ||
rpc ActivateSilence(...) returns(...) {} | ||
rpc DeactivateSilence(...) returns (...) {} | ||
rpc Timeline(...) returns(...) {} | ||
} | ||
service RuntimeService{ | ||
rpc ConnectRemoteSyncer(...) returns (stream ...) {} | ||
} | ||
``` | ||
|
||
- Aim to physically separate in alerting gateway plugin encapsulation logic for | ||
- config CRUD accross services to a `MetadataServer` | ||
- runtime states & dependency management to a `ManagementServer` | ||
|
||
```go | ||
type MetadataServer struct{ | ||
alertingv2.UnsafeAlarmConfig | ||
alertingv2.UnsafeEndpointConfig | ||
alertingv2.UnsafeRoutingConfig | ||
alertingv2.UnsafeTenantConfig | ||
} | ||
``` | ||
|
||
```go | ||
type ManagementServer struct { | ||
alertingv2.UnsafeNotificationService | ||
alertingv1.UnsafeRuntimeService | ||
} | ||
``` | ||
|
||
### Protobuf changes | ||
|
||
- move `AlertCondition` messages to `Alarm` messages in v2 apis | ||
|
||
- `Alarm` should implement an equality comparer interface, which handles deduping equal alarms when multiple tenant configs | ||
attempt to apply the same spec | ||
|
||
```go | ||
var _ util.EqualityComparer = (*Alarm)(nil) | ||
``` | ||
|
||
```proto | ||
// v2 | ||
message Alarm { | ||
string id = 1; | ||
string name = 2; | ||
string description = 3; | ||
core.Reference clusterId = 4; | ||
// OpniSeverity is required by the Opni system | ||
OpniSeverity severity = 5; | ||
// hold implementation details | ||
// | ||
// - rate limiting configs | ||
// - send-resolved=yes/no | ||
// - last-updated | ||
// - silence-info | ||
map<string,string> properties = 6; | ||
// hold message content variables from Alertmanager format | ||
// | ||
// Body & Description are now set here using specific Keys | ||
// | ||
// Users should be able to add their own custom annotations from the UI as well | ||
map<string,string> annotations = 7; | ||
// labels for custom routing | ||
map<string,string> routingLabels = 8; | ||
// condition spec | ||
AlertTypeDetails alertType = 9; | ||
core.ReferenceList attachedEndpoints = 10; | ||
} | ||
``` | ||
|
||
- move `AlertEndpoint` messages to `Endpoint` messages in v2 apis | ||
|
||
```proto | ||
// v2 | ||
message Endpoint { | ||
string id = 1; | ||
string name = 2; | ||
string description = 3; | ||
// Holds implementation details | ||
// | ||
// - holds receive-notification = on/off | ||
// - holds last updated time | ||
// - default/fallback rate limiting config when using label matchers | ||
map<string,string> properties = 4; | ||
// custom routing labels. | ||
map<string,string> labels = 5; | ||
oneof endpoint { | ||
SlackEndpoint slack = 6; | ||
EmailEndpoint email = 7; | ||
PagerDutyEndpoint pagerDuty = 8; | ||
WebhookEndpoint webhook = 9; | ||
} | ||
} | ||
``` | ||
|
||
### Alerting Storage Clientset migration enhancement | ||
|
||
Since we will be releasing a new version of the API the storage clientset needs to migrate buckets from | ||
one object type to another, for example : `alertingv1.AlertCondition` -> `alertingv2.Alarm` | ||
|
||
- Have a seperate buckets for each relevant alertingv2. type objects | ||
- The alerting storage client set interfaces should be changed to use `proto.Message` | ||
- When we invoke `Use` on start-up for the clientset to use reflection to check contents of each bucket version, then migrate old buckets to new buckets | ||
|
||
### CRUD BatchTenantConfiguration API(s) | ||
|
||
```proto | ||
// Also referred to as BTC | ||
message BatchTenantConfiguration { | ||
// opaque | ||
string id = 1; | ||
// name the user sets | ||
string name = 2; | ||
// manually selected clusters to apply this template to | ||
core.ReferenceList clusters = 3; | ||
// apply this template to any clusters matching these labels | ||
map<string,string> clusterLabelMatchers = 4; | ||
// alarms to apply by default to this | ||
AlertConditionList conditions = 5; | ||
} | ||
message BatchTenantConfigurationList{ | ||
repeated items BatchTenantConfiguration = 1; | ||
} | ||
service TenantConfigs { | ||
// Tenant template | ||
rpc ListTenantConfig(google.protobuf.Empty) returns (BatchTenantConfigurationList){} | ||
rpc CreateTenantConfig(BatchTenantConfiguration) returns (google.protobuf.Empty){} | ||
rpc GetTenantConfig(core.Reference) returns (BatchTenantConfiguration){} | ||
rpc UpdateTenantConfig(BatchTenantConfiguration) returns google.protobuf.Empty{} | ||
rpc DeleteTenantConfig(BatchTenantConfiguration) returns google.protobuf.Empty{} | ||
} | ||
``` | ||
|
||
### UI/UX | ||
|
||
See [acceptance criteria](#acceptance-criteria) | ||
|
||
## Supporting documents | ||
|
||
- Generic Prometheus Rule group format : https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/ | ||
- AlertManager label matchers format : https://github.com/prometheus/alertmanager/blob/fd0929ba9fc58737a9c91f24771862692fa72d17/pkg/labels/matcher.go | ||
|
||
## Risks and contingencies: | ||
|
||
N/A | ||
|
||
## Level of Effort: | ||
|
||
### Backend | ||
|
||
Approximately 5-6 weeks | ||
|
||
- 3 weeks v2 api | ||
|
||
- 1 week API apis | ||
- 1 week support batch routing | ||
- 1 week cli | ||
|
||
- 1-2 days batch prometheus rule group (requires v2 -- annotations and label matchers should be imported as well) | ||
|
||
- 1 1/2 weeks CRUD Tenant configurations (requires v2 -- significant internal rework of v2 API will impact implementation details of tenant configs) | ||
|
||
- 1 week API implementation | ||
- 2-3 days testing | ||
|
||
### UI | ||
|
||
??? days | ||
|
||
## Resources: | ||
|
||
- git alerting staging branch | ||
- 1 Opni Upstream & 1 Opni Downstream cluster |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.