Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions config/samples/alerts/incoming-traffic-surge.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
apiVersion: monitoring.openshift.io/v1
kind: AlertingRule
metadata:
name: my-custom-rule
namespace: openshift-monitoring
spec:
groups:
- name: MyAlertsForNetObserv
rules:
- alert: IncomingTrafficSurge
annotations:
message: |-
NetObserv is detecting a surge of incoming traffic: current traffic to {{ $labels.DstK8S_Namespace }} has increased by more than 100% since yesterday.
summary: "Surge in incoming traffic"
netobserv_io_network_health: '{"namespaceLabels":["DstK8S_Namespace"],"threshold":"100","unit":"%","upperBound":"500"}'
expr: |-
(100 *
(
(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
- sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
)
/ sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
> 100
for: 1m
labels:
netobserv: "true"
severity: warning
4 changes: 2 additions & 2 deletions config/samples/alerts/ingress-errors.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ spec:
annotations:
description: There are more than 10% of 5xx HTTP response codes returned from ingress traffic, to namespace
{{ $labels.exported_namespace }}.
netobserv_io_network_health: '{"threshold":"10","unit":"%","namespaceLabels":["exported_namespace"]}'
summary: Too many 5xx errors to namespace
netobserv_io_network_health: '{"threshold":"10","unit":"%","namespaceLabels":["exported_namespace"]}'
expr: 100 * sum(rate(haproxy_server_http_responses_total{code="5xx"}[2m])) by (exported_namespace)
/ sum(rate(haproxy_server_http_responses_total[2m])) by (exported_namespace) > 10
for: 5m
Expand All @@ -23,8 +23,8 @@ spec:
annotations:
description: There are more than 10% of 4xx HTTP response codes returned from ingress traffic, to namespace
{{ $labels.exported_namespace }}.
netobserv_io_network_health: '{"threshold":"10","unit":"%","namespaceLabels":["exported_namespace"]}'
summary: Too many 4xx errors to namespace
netobserv_io_network_health: '{"threshold":"10","unit":"%","namespaceLabels":["exported_namespace"]}'
expr: 100 * sum(rate(haproxy_server_http_responses_total{code="4xx"}[2m])) by (exported_namespace)
/ sum(rate(haproxy_server_http_responses_total[2m])) by (exported_namespace) > 10
for: 5m
Expand Down
99 changes: 63 additions & 36 deletions docs/Alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,8 @@ By default, NetObserv creates some alerts, contextual to the enabled features. F

Here is the list of alerts installed by default:

- `PacketDropsByDevice`: triggered on high percentage of packet drops from devices (`/proc/net/dev`):
- grouped by node, with "Warning" severity above 5%
- `PacketDropsByKernel`: triggered on high percentage of packet drops by the kernel; it requires the `PacketDrop` agent feature. 2 variants installed by default:
- grouped by node, with "Info" severity above 5% and "Warning" above 10%
- grouped by namespace, with "Info" severity above 10% and "Warning" above 20%
- `PacketDropsByDevice`: triggered on high percentage of packet drops from devices (`/proc/net/dev`).
- `PacketDropsByKernel`: triggered on high percentage of packet drops by the kernel; it requires the `PacketDrop` agent feature.
- `IPsecErrors`: triggered when NetObserv detects IPsec encyption errors; it requires the `IPSec` agent feature.
- `NetpolDenied`: triggered when NetObserv detects traffic denied by network policies; it requires the `NetworkEvents` agent feature.
- `LatencyHighTrend`: triggered when NetObserv detects an increase of TCP latency; it requires the `FlowRTT` agent feature.
Expand Down Expand Up @@ -70,57 +67,87 @@ Alert templates can be disabled in `spec.processor.metrics.disableAlerts`. This

If a template is disabled _and_ overridden in `spec.processor.metrics.alerts`, the disable setting takes precedence: the alert rule will not be created.

## Creating rules from scratch
## Creating your own alerts that contribute to the Health dashboard

This alerting API in NetObserv `FlowCollector` is simply a mapping to the Prometheus operator API, generating a `PrometheusRule` that you can see in the `netobserv` namespace (by default) by running:
This alerting API in NetObserv `FlowCollector` is simply a mapping to the Prometheus operator API, generating a `PrometheusRule`.

You can check what is the actual generated resource by running:

```bash
kubectl get prometheusrules -n netobserv -oyaml
```

The sections above explain how you can customize those opinionated alerts, but should you feel limited with this configuration API, you can go further and create your own `AlertingRule` resources. You'll just need to be familiar with PromQL (or to learn).
While the above sections explain how you can customize those opinionated alerts, you are not limited to them: you can go further and create your own `AlertingRule` (or `PrometheusRule`) resources. You'll just need to be familiar with PromQL (or to learn).

[Click here](../config/samples/alerts) to see sample alerts, that are not built-in NetObserv.

Let's take the [incoming-traffic-surge](../config/samples/alerts/incoming-traffic-surge.yaml) as an example. What it does is raise an alert when the current ingress traffic exceeds by more than twice the traffic from the day before.

### Anatomy of the PromQL

Here is an example to alert when the current ingress traffic exceeds by more than twice the traffic from the day before.
Here's the PromQL:

```
(100 *
(
(sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
- sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
)
/ sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
> 100
```

Let's break it down. The base query pattern is this:

`sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace)`

This is the bytes rate coming from "openshift-ingress" to any of your workload's namespaces, over the last 30 minutes. This metric is provided by NetObserv (note that depending on your FlowCollector configuration, you may need to use `netobserv_namespace_ingress_bytes_total` instead of `netobserv_workload_ingress_bytes_total`).

Appending ` > 1000` to this query keeps only the rates observed greater than 1KBps, in order to eliminate the noise from low-bandwidth consumers. 1KBps still isn't a lot, you may want to increase it. Note also that the bytes rate is relative to the sampling interval defined in the `FlowCollector` agent configuration. If you have a sampling ratio of 1:100, consider that the actual traffic might be approximately 100 times higher than what is reported by the metrics. Alternatively, the metric `netobserv_agent_sampling_rate` can be use to normalize the byte rates, decoupling the promql from the sampling configuration.

In the following parts of the PromQL, you can see `offset 1d`: this is to run the same query, one day earlier. You can change that according to your needs, for instance `offset 5h` will be five hours ago.

Which gives us the formula `100 * (<query now> - <query yesterday>) / <query yesterday>`: it's the percentage of increase compared to yesterday. It can be negative, if the bytes rate today is lower than yesterday.

Finally, the last part, `> 100`, eliminates increases that are lower than 100%, so that we don't get alerted by that.

### Metadata

Some metadata is required to work with Prometheus and AlertManager (not specific to NetObserv):

```yaml
apiVersion: monitoring.openshift.io/v1
kind: AlertingRule
metadata:
name: netobserv-alerts
namespace: openshift-monitoring
spec:
groups:
- name: NetObservAlerts
rules:
- alert: NetObservIncomingBandwidth
annotations:
message: |-
NetObserv is detecting a surge of incoming traffic: current traffic to {{ $labels.DstK8S_Namespace }} has increased by more than 100% since yesterday.
summary: "Surge in incoming traffic"
expr: |-
(100 *
(
(sum(rate(netobserv_namespace_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
- sum(rate(netobserv_namespace_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
)
/ sum(rate(netobserv_namespace_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
> 100
for: 1m
labels:
app: netobserv
severity: warning
```

Let's break it down to understand the PromQL expression. The base query pattern is this:
As you can see, you can leverage the output labels from the PromQL defined previously in the description. Here, since we've grouped the results per `DstK8S_Namespace`, we can use it in our text.

`sum(rate(netobserv_namespace_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace)`
The severity label should be "critical", "warning" or "info".

This is the bytes rate coming from "openshift-ingress" to any of your workload's namespaces, over the last 30 minutes. Note that depending on your configuration, you may need to use `netobserv_workload_ingress_bytes_total` instead of `netobserv_namespace_ingress_bytes_total`.
On top of that, in order to have the alert picked up in the Health dashboard, NetObserv needs other information:

Appending ` > 1000` to this query keeps only the rates observed greater than 1KBps, in order to eliminate the noise from low-bandwidth consumers. 1KBps still isn't a lot, you may want to increase it. Note also that the bytes rate is relative to the sampling ratio defined in the `FlowCollector` agent configuration. If you have a sampling ratio of 1:100, consider that the actual traffic might be approximately 100 times higher than what is reported by the metrics.
```yaml
annotations:
netobserv_io_network_health: '{"namespaceLabels":["DstK8S_Namespace"],"threshold":"100","unit":"%","upperBound":"500"}'
labels:
netobserv: "true"
```

In the following parts of the PromQL, you can see `offset 1d`: this is to run the same query, one day before. You can change that according to your needs, for instance `offset 5h` will be five hours ago.
The label `netobserv: "true"` is required.

Which gives us the formula `100 * (<query now> - <query yesterday>) / <query yesterday>`: it's the percentage of increase compared to yesterday. It can be negative, if the bytes rate today is lower than yesterday.
The annotation `netobserv_io_network_health` is optional, and gives you some control on how the alert renders in the Health page. It is a JSON string that consists in:
- `namespaceLabels`: one or more labels that hold namespaces. When provided, the alert will show up under the "Namespaces" tab.
- `nodeLabels`: one or more labels that hold node names. When provided, the alert will show up under the "Nodes" tab.
- `threshold`: the alert threshold as a string, expected to match the one defined in PromQL.
- `unit`: the data unit, used only for display purpose.
- `upperBound`: an upper bound value used to compute score on a closed scale. It doesn't necessarily have to be a maximum of the metric values, but metric values will be clamped if they are above the upper bound.
- `links`: a list of links to be displayed contextually to the alert. Each link consists in:
- `name`: display name.
- `url`: the link URL.
- `trafficLinkFilter`: an additional filter to inject into the URL for the Network Traffic page.

Finally, the last part, `> 100`, eliminates increases that are lower than 100%, so that we don't get alerted by that.
`namespaceLabels` and `nodeLabels` are mutually exclusive. If none of them is provided, the alert will show up under the "Global" tab.