netobserv · jotak · Oct 13, 2025 · Oct 3, 2025 · Oct 6, 2025
diff --git a/config/samples/alerts/incoming-traffic-surge.yaml b/config/samples/alerts/incoming-traffic-surge.yaml
@@ -0,0 +1,27 @@
+apiVersion: monitoring.openshift.io/v1
+kind: AlertingRule
+metadata:
+  name: my-custom-rule
+  namespace: openshift-monitoring
+spec:
+  groups:
+  - name: MyAlertsForNetObserv
+    rules:
+    - alert: IncomingTrafficSurge
+      annotations:
+        message: |-
+          NetObserv is detecting a surge of incoming traffic: current traffic to {{ $labels.DstK8S_Namespace }} has increased by more than 100% since yesterday.
+        summary: "Surge in incoming traffic"
+        netobserv_io_network_health: '{"namespaceLabels":["DstK8S_Namespace"],"threshold":"100","unit":"%","upperBound":"500"}'
+      expr: |-
+        (100 *
+          (
+            (sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
+            - sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
+          )
+          / sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
+        > 100
+      for: 1m
+      labels:
+        netobserv: "true"
+        severity: warning
diff --git a/config/samples/alerts/ingress-errors.yaml b/config/samples/alerts/ingress-errors.yaml
@@ -11,8 +11,8 @@ spec:
       annotations:
         description: There are more than 10% of 5xx HTTP response codes returned from ingress traffic, to namespace
           {{ $labels.exported_namespace }}.
-        netobserv_io_network_health: '{"threshold":"10","unit":"%","namespaceLabels":["exported_namespace"]}'
         summary: Too many 5xx errors to namespace
+        netobserv_io_network_health: '{"threshold":"10","unit":"%","namespaceLabels":["exported_namespace"]}'
       expr: 100 * sum(rate(haproxy_server_http_responses_total{code="5xx"}[2m])) by (exported_namespace)
         / sum(rate(haproxy_server_http_responses_total[2m])) by (exported_namespace) > 10
       for: 5m
@@ -23,8 +23,8 @@ spec:
       annotations:
         description: There are more than 10% of 4xx HTTP response codes returned from ingress traffic, to namespace
           {{ $labels.exported_namespace }}.
-        netobserv_io_network_health: '{"threshold":"10","unit":"%","namespaceLabels":["exported_namespace"]}'
         summary: Too many 4xx errors to namespace
+        netobserv_io_network_health: '{"threshold":"10","unit":"%","namespaceLabels":["exported_namespace"]}'
       expr: 100 * sum(rate(haproxy_server_http_responses_total{code="4xx"}[2m])) by (exported_namespace)
         / sum(rate(haproxy_server_http_responses_total[2m])) by (exported_namespace) > 10
       for: 5m

diff --git a/docs/Alerts.md b/docs/Alerts.md
@@ -13,11 +13,8 @@ By default, NetObserv creates some alerts, contextual to the enabled features. F
 
 Here is the list of alerts installed by default:
 
-- `PacketDropsByDevice`: triggered on high percentage of packet drops from devices (`/proc/net/dev`):
-  - grouped by node, with "Warning" severity above 5%
-- `PacketDropsByKernel`: triggered on high percentage of packet drops by the kernel; it requires the `PacketDrop` agent feature. 2 variants installed by default:
-  - grouped by node, with "Info" severity above 5% and "Warning" above 10%
-  - grouped by namespace, with "Info" severity above 10% and "Warning" above 20%
+- `PacketDropsByDevice`: triggered on high percentage of packet drops from devices (`/proc/net/dev`).
+- `PacketDropsByKernel`: triggered on high percentage of packet drops by the kernel; it requires the `PacketDrop` agent feature.
 - `IPsecErrors`: triggered when NetObserv detects IPsec encyption errors; it requires the `IPSec` agent feature.
 - `NetpolDenied`: triggered when NetObserv detects traffic denied by network policies; it requires the `NetworkEvents` agent feature.
 - `LatencyHighTrend`: triggered when NetObserv detects an increase of TCP latency; it requires the `FlowRTT` agent feature.
@@ -70,57 +67,87 @@ Alert templates can be disabled in `spec.processor.metrics.disableAlerts`. This
 
 If a template is disabled _and_ overridden in `spec.processor.metrics.alerts`, the disable setting takes precedence: the alert rule will not be created.
 
-## Creating rules from scratch
+## Creating your own alerts that contribute to the Health dashboard
 
-This alerting API in NetObserv `FlowCollector` is simply a mapping to the Prometheus operator API, generating a `PrometheusRule` that you can see in the `netobserv` namespace (by default) by running:
+This alerting API in NetObserv `FlowCollector` is simply a mapping to the Prometheus operator API, generating a `PrometheusRule`.
+
+You can check what is the actual generated resource by running:
 
 ```bash
 kubectl get prometheusrules -n netobserv -oyaml
 ```
 
-The sections above explain how you can customize those opinionated alerts, but should you feel limited with this configuration API, you can go further and create your own `AlertingRule` resources. You'll just need to be familiar with PromQL (or to learn).
+While the above sections explain how you can customize those opinionated alerts, you are not limited to them: you can go further and create your own `AlertingRule` (or `PrometheusRule`) resources. You'll just need to be familiar with PromQL (or to learn).
+
+[Click here](../config/samples/alerts) to see sample alerts, that are not built-in NetObserv.
+
+Let's take the [incoming-traffic-surge](../config/samples/alerts/incoming-traffic-surge.yaml) as an example. What it does is raise an alert when the current ingress traffic exceeds by more than twice the traffic from the day before.
+
+### Anatomy of the PromQL
 
-Here is an example to alert when the current ingress traffic exceeds by more than twice the traffic from the day before.
+Here's the PromQL:
+
+```
+(100 *
+  (
+    (sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
+    - sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
+  )
+  / sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
+> 100
+```
+
+Let's break it down. The base query pattern is this:
+
+`sum(rate(netobserv_workload_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace)`
+
+This is the bytes rate coming from "openshift-ingress" to any of your workload's namespaces, over the last 30 minutes. This metric is provided by NetObserv (note that depending on your FlowCollector configuration, you may need to use `netobserv_namespace_ingress_bytes_total` instead of `netobserv_workload_ingress_bytes_total`).
+
+Appending ` > 1000` to this query keeps only the rates observed greater than 1KBps, in order to eliminate the noise from low-bandwidth consumers. 1KBps still isn't a lot, you may want to increase it. Note also that the bytes rate is relative to the sampling interval defined in the `FlowCollector` agent configuration. If you have a sampling ratio of 1:100, consider that the actual traffic might be approximately 100 times higher than what is reported by the metrics. Alternatively, the metric `netobserv_agent_sampling_rate` can be use to normalize the byte rates, decoupling the promql from the sampling configuration.
+
+In the following parts of the PromQL, you can see `offset 1d`: this is to run the same query, one day earlier. You can change that according to your needs, for instance `offset 5h` will be five hours ago.
+
+Which gives us the formula `100 * (<query now> - <query yesterday>) / <query yesterday>`: it's the percentage of increase compared to yesterday. It can be negative, if the bytes rate today is lower than yesterday.
+
+Finally, the last part, `> 100`, eliminates increases that are lower than 100%, so that we don't get alerted by that.
+
+### Metadata
+
+Some metadata is required to work with Prometheus and AlertManager (not specific to NetObserv):
 
 ```yaml
-apiVersion: monitoring.openshift.io/v1
-kind: AlertingRule
-metadata:
-  name: netobserv-alerts
-  namespace: openshift-monitoring
-spec:
-  groups:
-  - name: NetObservAlerts
-    rules:
-    - alert: NetObservIncomingBandwidth
       annotations:
         message: |-
           NetObserv is detecting a surge of incoming traffic: current traffic to {{ $labels.DstK8S_Namespace }} has increased by more than 100% since yesterday.
         summary: "Surge in incoming traffic"
-      expr: |-
-        (100 *
-          (
-            (sum(rate(netobserv_namespace_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace) > 1000)
-            - sum(rate(netobserv_namespace_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace)
-          )
-          / sum(rate(netobserv_namespace_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m] offset 1d)) by (DstK8S_Namespace))
-        > 100
-      for: 1m
       labels:
-        app: netobserv
         severity: warning
 ```
 
-Let's break it down to understand the PromQL expression. The base query pattern is this:
+As you can see, you can leverage the output labels from the PromQL defined previously in the description. Here, since we've grouped the results per `DstK8S_Namespace`, we can use it in our text.
 
-`sum(rate(netobserv_namespace_ingress_bytes_total{SrcK8S_Namespace="openshift-ingress"}[30m])) by (DstK8S_Namespace)`
+The severity label should be "critical", "warning" or "info".
 
-This is the bytes rate coming from "openshift-ingress" to any of your workload's namespaces, over the last 30 minutes. Note that depending on your configuration, you may need to use `netobserv_workload_ingress_bytes_total` instead of `netobserv_namespace_ingress_bytes_total`.
+On top of that, in order to have the alert picked up in the Health dashboard, NetObserv needs other information:
 
-Appending ` > 1000` to this query keeps only the rates observed greater than 1KBps, in order to eliminate the noise from low-bandwidth consumers. 1KBps still isn't a lot, you may want to increase it. Note also that the bytes rate is relative to the sampling ratio defined in the `FlowCollector` agent configuration. If you have a sampling ratio of 1:100, consider that the actual traffic might be approximately 100 times higher than what is reported by the metrics.
+```yaml
+      annotations:
+        netobserv_io_network_health: '{"namespaceLabels":["DstK8S_Namespace"],"threshold":"100","unit":"%","upperBound":"500"}'
+      labels:
+        netobserv: "true"
+```
 
-In the following parts of the PromQL, you can see `offset 1d`: this is to run the same query, one day before. You can change that according to your needs, for instance `offset 5h` will be five hours ago.
+The label `netobserv: "true"` is required.
 
-Which gives us the formula `100 * (<query now> - <query yesterday>) / <query yesterday>`: it's the percentage of increase compared to yesterday. It can be negative, if the bytes rate today is lower than yesterday.
+The annotation `netobserv_io_network_health` is optional, and gives you some control on how the alert renders in the Health page. It is a JSON string that consists in:
+- `namespaceLabels`: one or more labels that hold namespaces. When provided, the alert will show up under the "Namespaces" tab.
+- `nodeLabels`: one or more labels that hold node names. When provided, the alert will show up under the "Nodes" tab.
+- `threshold`: the alert threshold as a string, expected to match the one defined in PromQL.
+- `unit`: the data unit, used only for display purpose.
+- `upperBound`: an upper bound value used to compute score on a closed scale. It doesn't necessarily have to be a maximum of the metric values, but metric values will be clamped if they are above the upper bound.
+- `links`: a list of links to be displayed contextually to the alert. Each link consists in:
+  - `name`: display name.
+  - `url`: the link URL.
+- `trafficLinkFilter`: an additional filter to inject into the URL for the Network Traffic page.
 
-Finally, the last part, `> 100`, eliminates increases that are lower than 100%, so that we don't get alerted by that.
+`namespaceLabels` and `nodeLabels` are mutually exclusive. If none of them is provided, the alert will show up under the "Global" tab.