diff --git a/config/samples/alerts/incoming-traffic-surge.yaml b/config/samples/alerts/incoming-traffic-surge.yaml new file mode 100644 index 000000000..641bb40d0 --- /dev/null +++ b/config/samples/alerts/incoming-traffic-surge.yaml @@ -0,0 +1,27 @@ +apiVersion: monitoring.openshift.io/v1 +kind: AlertingRule +metadata: + name: my-custom-rule + namespace: openshift-monitoring +spec: + groups: + - name: MyAlertsForNetObserv + rules: + - alert: IncomingTrafficSurge + annotations: + message: |- + NetObserv is detecting a surge of incoming traffic: current traffic to {{ $labels.DstK8S_Namespace }} has increased by more than 100% since yesterday. + summary: "Surge in incoming traffic" + netobserv_io_network_health: '{"namespaceLabels":["DstK8S_Namespace"],"threshold":"100","unit":"%","upperBound":"500"}' + expr: |- + (100 * + ( + (sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000) + - sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace) + ) + / sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)) + > 100 + for: 1m + labels: + netobserv: "true" + severity: warning diff --git a/config/samples/alerts/ingress-errors.yaml b/config/samples/alerts/ingress-errors.yaml index c50a19474..e4f8f8e61 100644 --- a/config/samples/alerts/ingress-errors.yaml +++ b/config/samples/alerts/ingress-errors.yaml @@ -11,8 +11,8 @@ spec: annotations: description: There are more than 10% of 5xx HTTP response codes returned from ingress traffic, to namespace {{ $labels.exported_namespace }}. - netobserv_io_network_health: '{"threshold":"10","unit":"%","namespaceLabels":["exported_namespace"]}' summary: Too many 5xx errors to namespace + netobserv_io_network_health: '{"threshold":"10","unit":"%","namespaceLabels":["exported_namespace"]}' expr: 100 * sum(rate(haproxy_server_http_responses_total{code="5xx"}[2m])) by (exported_namespace) / sum(rate(haproxy_server_http_responses_total[2m])) by (exported_namespace) > 10 for: 5m @@ -23,8 +23,8 @@ spec: annotations: description: There are more than 10% of 4xx HTTP response codes returned from ingress traffic, to namespace {{ $labels.exported_namespace }}. - netobserv_io_network_health: '{"threshold":"10","unit":"%","namespaceLabels":["exported_namespace"]}' summary: Too many 4xx errors to namespace + netobserv_io_network_health: '{"threshold":"10","unit":"%","namespaceLabels":["exported_namespace"]}' expr: 100 * sum(rate(haproxy_server_http_responses_total{code="4xx"}[2m])) by (exported_namespace) / sum(rate(haproxy_server_http_responses_total[2m])) by (exported_namespace) > 10 for: 5m diff --git a/docs/Alerts.md b/docs/Alerts.md index 09125090f..6e64eacee 100644 --- a/docs/Alerts.md +++ b/docs/Alerts.md @@ -13,11 +13,8 @@ By default, NetObserv creates some alerts, contextual to the enabled features. F Here is the list of alerts installed by default: -- `PacketDropsByDevice`: triggered on high percentage of packet drops from devices (`/proc/net/dev`): - - grouped by node, with "Warning" severity above 5% -- `PacketDropsByKernel`: triggered on high percentage of packet drops by the kernel; it requires the `PacketDrop` agent feature. 2 variants installed by default: - - grouped by node, with "Info" severity above 5% and "Warning" above 10% - - grouped by namespace, with "Info" severity above 10% and "Warning" above 20% +- `PacketDropsByDevice`: triggered on high percentage of packet drops from devices (`/proc/net/dev`). +- `PacketDropsByKernel`: triggered on high percentage of packet drops by the kernel; it requires the `PacketDrop` agent feature. - `IPsecErrors`: triggered when NetObserv detects IPsec encyption errors; it requires the `IPSec` agent feature. - `NetpolDenied`: triggered when NetObserv detects traffic denied by network policies; it requires the `NetworkEvents` agent feature. - `LatencyHighTrend`: triggered when NetObserv detects an increase of TCP latency; it requires the `FlowRTT` agent feature. @@ -70,57 +67,87 @@ Alert templates can be disabled in `spec.processor.metrics.disableAlerts`. This If a template is disabled _and_ overridden in `spec.processor.metrics.alerts`, the disable setting takes precedence: the alert rule will not be created. -## Creating rules from scratch +## Creating your own alerts that contribute to the Health dashboard -This alerting API in NetObserv `FlowCollector` is simply a mapping to the Prometheus operator API, generating a `PrometheusRule` that you can see in the `netobserv` namespace (by default) by running: +This alerting API in NetObserv `FlowCollector` is simply a mapping to the Prometheus operator API, generating a `PrometheusRule`. + +You can check what is the actual generated resource by running: ```bash kubectl get prometheusrules -n netobserv -oyaml ``` -The sections above explain how you can customize those opinionated alerts, but should you feel limited with this configuration API, you can go further and create your own `AlertingRule` resources. You'll just need to be familiar with PromQL (or to learn). +While the above sections explain how you can customize those opinionated alerts, you are not limited to them: you can go further and create your own `AlertingRule` (or `PrometheusRule`) resources. You'll just need to be familiar with PromQL (or to learn). + +[Click here](../config/samples/alerts) to see sample alerts, that are not built-in NetObserv. + +Let's take the [incoming-traffic-surge](../config/samples/alerts/incoming-traffic-surge.yaml) as an example. What it does is raise an alert when the current ingress traffic exceeds by more than twice the traffic from the day before. + +### Anatomy of the PromQL -Here is an example to alert when the current ingress traffic exceeds by more than twice the traffic from the day before. +Here's the PromQL: + +``` +(100 * + ( + (sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000) + - sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace) + ) + / sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)) +> 100 +``` + +Let's break it down. The base query pattern is this: + +`sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace)` + +This is the bytes rate coming from "openshift-ingress" to any of your workload's namespaces, over the last 30 minutes. This metric is provided by NetObserv (note that depending on your FlowCollector configuration, you may need to use `netobserv_namespace_ingress_bytes_total` instead of `netobserv_workload_ingress_bytes_total`). + +Appending ` > 1000` to this query keeps only the rates observed greater than 1KBps, in order to eliminate the noise from low-bandwidth consumers. 1KBps still isn't a lot, you may want to increase it. Note also that the bytes rate is relative to the sampling interval defined in the `FlowCollector` agent configuration. If you have a sampling ratio of 1:100, consider that the actual traffic might be approximately 100 times higher than what is reported by the metrics. Alternatively, the metric `netobserv_agent_sampling_rate` can be use to normalize the byte rates, decoupling the promql from the sampling configuration. + +In the following parts of the PromQL, you can see `offset 1d`: this is to run the same query, one day earlier. You can change that according to your needs, for instance `offset 5h` will be five hours ago. + +Which gives us the formula `100 * ( - ) / `: it's the percentage of increase compared to yesterday. It can be negative, if the bytes rate today is lower than yesterday. + +Finally, the last part, `> 100`, eliminates increases that are lower than 100%, so that we don't get alerted by that. + +### Metadata + +Some metadata is required to work with Prometheus and AlertManager (not specific to NetObserv): ```yaml -apiVersion: monitoring.openshift.io/v1 -kind: AlertingRule -metadata: - name: netobserv-alerts - namespace: openshift-monitoring -spec: - groups: - - name: NetObservAlerts - rules: - - alert: NetObservIncomingBandwidth annotations: message: |- NetObserv is detecting a surge of incoming traffic: current traffic to {{ $labels.DstK8S_Namespace }} has increased by more than 100% since yesterday. summary: "Surge in incoming traffic" - expr: |- - (100 * - ( - (sum(rate(netobserv_namespace_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000) - - sum(rate(netobserv_namespace_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace) - ) - / sum(rate(netobserv_namespace_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)) - > 100 - for: 1m labels: - app: netobserv severity: warning ``` -Let's break it down to understand the PromQL expression. The base query pattern is this: +As you can see, you can leverage the output labels from the PromQL defined previously in the description. Here, since we've grouped the results per `DstK8S_Namespace`, we can use it in our text. -`sum(rate(netobserv_namespace_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace)` +The severity label should be "critical", "warning" or "info". -This is the bytes rate coming from "openshift-ingress" to any of your workload's namespaces, over the last 30 minutes. Note that depending on your configuration, you may need to use `netobserv_workload_ingress_bytes_total` instead of `netobserv_namespace_ingress_bytes_total`. +On top of that, in order to have the alert picked up in the Health dashboard, NetObserv needs other information: -Appending ` > 1000` to this query keeps only the rates observed greater than 1KBps, in order to eliminate the noise from low-bandwidth consumers. 1KBps still isn't a lot, you may want to increase it. Note also that the bytes rate is relative to the sampling ratio defined in the `FlowCollector` agent configuration. If you have a sampling ratio of 1:100, consider that the actual traffic might be approximately 100 times higher than what is reported by the metrics. +```yaml + annotations: + netobserv_io_network_health: '{"namespaceLabels":["DstK8S_Namespace"],"threshold":"100","unit":"%","upperBound":"500"}' + labels: + netobserv: "true" +``` -In the following parts of the PromQL, you can see `offset 1d`: this is to run the same query, one day before. You can change that according to your needs, for instance `offset 5h` will be five hours ago. +The label `netobserv: "true"` is required. -Which gives us the formula `100 * ( - ) / `: it's the percentage of increase compared to yesterday. It can be negative, if the bytes rate today is lower than yesterday. +The annotation `netobserv_io_network_health` is optional, and gives you some control on how the alert renders in the Health page. It is a JSON string that consists in: +- `namespaceLabels`: one or more labels that hold namespaces. When provided, the alert will show up under the "Namespaces" tab. +- `nodeLabels`: one or more labels that hold node names. When provided, the alert will show up under the "Nodes" tab. +- `threshold`: the alert threshold as a string, expected to match the one defined in PromQL. +- `unit`: the data unit, used only for display purpose. +- `upperBound`: an upper bound value used to compute score on a closed scale. It doesn't necessarily have to be a maximum of the metric values, but metric values will be clamped if they are above the upper bound. +- `links`: a list of links to be displayed contextually to the alert. Each link consists in: + - `name`: display name. + - `url`: the link URL. +- `trafficLinkFilter`: an additional filter to inject into the URL for the Network Traffic page. -Finally, the last part, `> 100`, eliminates increases that are lower than 100%, so that we don't get alerted by that. +`namespaceLabels` and `nodeLabels` are mutually exclusive. If none of them is provided, the alert will show up under the "Global" tab.