SeldonIO · Rajakavitha1 · Feb 12, 2025 · lc525 · Feb 17, 2025
@@ -10,24 +10,15 @@
          * [Managed Kafka](installation/production-environment/kafka/managed-kafka.md) 
       * [Ingress Controller](installation/production-environment/ingress-controller/istio.md)
     * [Test the Installation](installation/test-installation.md)
-
-* [Getting Started](getting-started/README.md)
-  * [Docker Installation](getting-started/docker-installation.md)
-  * [Kubernetes Installation](getting-started/kubernetes-installation/README.md)
-    * [Ansible](getting-started/kubernetes-installation/ansible.md)
-    * [Helm](getting-started/kubernetes-installation/helm.md)
-    * [Security](getting-started/kubernetes-installation/security/README.md)
-      * [AWS MSK mTLS](getting-started/kubernetes-installation/security/aws-msk-mtls.md)
-      * [AWS MSK SASL](getting-started/kubernetes-installation/security/aws-msk-sasl.md)
-      * [Azure Event Hub SASL Example](getting-started/kubernetes-installation/security/azure-event-hub-sasl.md)
-      * [Confluent Cloud Oauth 2.0 Example](getting-started/kubernetes-installation/security/confluent-oauth.md)
-      * [Confluent Cloud SASL Example](getting-started/kubernetes-installation/security/confluent-sasl.md)
-      * [Strimzi mTLS Example](getting-started/kubernetes-installation/security/strimzi-mtls.md)
-      * [Strimzi SASL Example](getting-started/kubernetes-installation/security/strimzi-sasl.md)
-      * [Reference](getting-started/kubernetes-installation/security/reference.md)
-  * [Configuration](getting-started/configuration.md)
-      * [Managed Kafka](getting-started/managed-kafka.md)
+
+## User Guide    
+* [Getting Started](getting-started/README.md)    
   * [Seldon CLI](getting-started/cli.md)
+* [Operational Monitoring](operational-monitoring/README.md)
+  * [Observability](operational-monitoring/observability.md)
+  * [Operational Metrics](operational-monitoring/operational.md)
+  * [Usage Metrics](operational-monitoring/usage.md)
+  * [Local Metrics](operational-monitoring/local-metrics-test.md)
 * [APIs](apis/README.md)
   * [Internal](apis/internal/README.md)
     * [Chainer](apis/internal/chainer.md)

@@ -0,0 +1,21 @@
+
+Seldon Core 2 provides robust tools for tracking the performance and health of machine learning models in production.
+
+## Monitoring
+
+* Real-Time metrics: collects and displays real-time metrics from deployed models, such as response times, error rates, and resource usage.
+* Model performance tracking: monitors key performance indicators (KPIs) like accuracy, drift detection, and model degradation over time.
+* Custom metrics: allows you to define and track custom metrics specific to their models and use cases.
+* Visualization: Provides dashboards and visualizations to easily observe the status and performance of models.
+
+There are two kinds of metrics present in Seldon Core 2 that you can monitor:
+* [operational metrics](./operational.md)
+* [usage metrics](./usage.md)
+
+Operational metrics describe the performance of components in the system. Some examples of common operational
+considerations are memory consumption and CPU usage, request latency and throughput, and cache utilisation rates.
+Generally speaking, these are the metrics system administrators, operations teams, and engineers will be interested in.
+
+Usage metrics describe the system at a higher and less dynamic level. Some examples include the number of deployed
+servers and models, and component versions. These are not typically metrics that engineers need insight into, but
+may be relevant to platform providers and operations teams.
@@ -0,0 +1,128 @@
+---
+description: >-
+  Installing kube-prometheus-stack in the same Kubernetes cluster that hosts the
+  Seldon Core 2.
+---
+
+# Monitoring
+
+`kube-prometheus`, also known as Prometheus Operator, is a popular open-source project that provides complete monitoring and alerting solutions for Kubernetes clusters. It combines tools and components to create a monitoring stack for Kubernetes environments.
+
+{% hint style="info" %}
+**Note**: Always install Prometheus within the same Kubernetes cluster as the Seldon Core 2.
+{% endhint %}
+
+The Seldon Core 2, along with any deployed models, automatically exposes metrics to Prometheus. By default, certain alerting rules are pre-configured, and an alertmanager instance is included.
+
+You can install `kube-prometheus` to monitor Seldon components, and ensure that the appropriate `ServiceMonitors` are in place for Seldon deployments. The analytics component is configured with the Prometheus integration. The monitoring for Seldon Core 2 is based on the Prometheus Operator and the related `PodMonitor` and `PrometheusRule` resources.
+
+Monitoring the model deployments in Seldon Core 2 involves:
+
+1. [Installing kube-prometheus](observability.md#installing-kube-prometheus)
+2. [Configuring monitoring](observability.md#configuring-monitoring-for-seldon-core-2)
+
+## Prerequisites
+
+1. Install [Seldon Core 2](../installation/production-environment/).
+2. Install [Ingress Controller](../installation/production-environment/ingress-controller/).
+3. Install [Grafana](https://grafana.com/docs/grafana/latest/setup-grafana/installation/helm/) in the namespace `seldon-monitoring`.
+
+## Installing kube-prometheus
+
+1.  Create a namespace for the monitoring components of Seldon Core 2.
+
+    ```
+    kubectl create ns seldon-monitoring || echo "Namespace seldon-monitoring already exists"
+    ```
+4.  Create a YAML file to specify the initial configuration. For example, create the `prometheus-values.yaml` file. Use your preferred text editor to create and save the file with the following content:
+
+    ```yaml
+    fullnameOverride: seldon-monitoring
+    kube-state-metrics:
+      extraArgs:
+        metric-labels-allowlist: pods=[*]
+    ```
+
+    **Note**: Make sure to include `metric-labels-allowlist: pods=[*]` in the Helm values file. If you are using your own Prometheus Operator installation, ensure that the pods labels, particularly `app.kubernetes.io/managed-by=seldon-core`, are part of the collected metrics. These labels are essential for calculating deployment usage rules.
+5.  Change to the directory that contains the `prometheus-values` file and run the following command to install version `9.5.12` of `kube-prometheus`.
+
+    ```
+    helm upgrade --install prometheus kube-prometheus \
+     --version 9.5.12 \
+     --namespace seldon-monitoring \
+     --values prometheus-values.yaml \
+     --repo https://charts.bitnami.com/bitnami
+    ```
+
+    When the installation is complete, you should see this:
+
+    ```
+    WARNING: There are "resources" sections in the chart not set. Using "resourcesPreset" is not recommended for production. For production installations, please set the following values according to your workload needs:
+      - alertmanager.resources
+      - blackboxExporter.resources
+      - operator.resources
+      - prometheus.resources
+      - prometheus.thanos.resources
+    +info https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
+
+    ```
+6.  Check the status of the installation.
+
+    ```
+    kubectl rollout status -n seldon-monitoring deployment/seldon-monitoring-operator
+    ```
+
+    When the installation is complete, you should see this:
+
+    ```
+    Waiting for deployment "seldon-monitoring-operator" rollout to finish: 0 of 1 updated replicas are available...
+    deployment "seldon-monitoring-operator" successfully rolled out
+    ```
+
+## Configuring monitoring for Seldon Core 2
+
+1.  You can access Prometheus from outside the cluster by running the following commands:
+
+    ```
+    echo "Prometheus URL: http://127.0.0.1:9090/"
+    kubectl port-forward --namespace seldon-monitoring svc/seldon-monitoring-prometheus 9090:9090
+    ```
+2.  You can access Alertmanager from outside the cluster by running the following commands:
+
+    ```
+    echo "Alertmanager URL: http://127.0.0.1:9093/"
+    kubectl port-forward --namespace seldon-monitoring svc/seldon-monitoring-alertmanager 9093:9093
+    ```
+3.  Apply the Custom RBAC Configuration settings for kube-prometheus.
+    ```bash
+    CUSTOM_RBAC=https://raw.githubusercontent.com/SeldonIO/seldon-core/v2.8.2/prometheus/rbac
+
+    kubectl apply -f ${CUSTOM_RBAC}/cr.yaml
+    ```
+4.  Configure metrics collection by createing the following `PodMonitor` resources.
+    ```bash
+    PODMONITOR_RESOURCE_LOCATION=https://raw.githubusercontent.com/SeldonIO/seldon-core/v2.8.2/prometheus/monitors
+
+    kubectl apply -f ${PODMONITOR_RESOURCE_LOCATION}/agent-podmonitor.yaml
+    kubectl apply -f ${PODMONITOR_RESOURCE_LOCATION}/envoy-servicemonitor.yaml
+    kubectl apply -f ${PODMONITOR_RESOURCE_LOCATION}/pipelinegateway-podmonitor.yaml
+    kubectl apply -f ${PODMONITOR_RESOURCE_LOCATION}/server-podmonitor.yaml
+    ```
+    When the resources are created, you should see this:
+    ```bash
+    podmonitor.monitoring.coreos.com/agent created
+    servicemonitor.monitoring.coreos.com/envoy created
+    podmonitor.monitoring.coreos.com/pipelinegateway created
+    podmonitor.monitoring.coreos.com/server created
+    ```  
+## Next
+
+### Prometheus User Interface
+You may now be able to check the status of Seldon components in Prometheus:
+1. Open your browser and navigate to `http://127.0.0.1:9090/` to access Prometheus UI from outside the cluster.
+1. Go to **Status** and select **Targets**.
+
+The status of all the endpoints and the scrape details are displayed.
+
+### Grafana
+You can view the metrics in Grafana Dashboard after you set Prometheus as the Data Source, and import `seldon.json` dashboard located at `seldon-core/v2.8.2/prometheus/dashboards` in [GitHub repository](https://github.com/SeldonIO/seldon-core/tree/v2/prometheus/dashboards).
@@ -1,13 +1,10 @@
 # Operational Metrics
 
-While the system is running we collect metrics via Prometheus that allow users to observe different
-aspects of SCv2 with regards to throughput, latency, memory, CPU etc. This is in addition to the standard
-Kubernetes metrics that are scraped by Prometheus. There is a also a Grafana dashboard (referenced below)
-that provides an overview of the system.
+While the system runs, Prometheus collects metrics that enable you to observe various aspects of Seldon Core 2, including throughput, latency, memory, and CPU usage. In addition to the standard Kubernetes metrics scraped by Prometheus, a [Grafana dashboard](observability.md/#grafana) provides a comprehensive system overview.
 
-## List of SCv2 metrics
+## List of Seldon Core 2 metrics
 
-The list of SCv2 metrics that we are compiling is as follows.
+The list of Seldon Core 2 metrics that are compiling is as follows.
 
 For the agent that sits next to the inference servers:
 
@@ -54,29 +51,10 @@ const (
 )
 ```
 
-Many of these metrics are model and pipeline level counters and gauges. We also aggregate some of
-these metrics to speed up the display of graphs. We don't presently store per-model histogram metrics
-for performance reasons. However, we do presently store per-pipeline histogram metrics.
+Many of these metrics are model and pipeline level counters and gauges. Some of these metrics are aggregated to speed up the display of graphs. Currently,per-model histogram metrics are not stored for performance reasons. However, per-pipeline histogram metrics are stored.
 
-This is experimental and these metrics are bound to change to reflect the trends we want to capture as
-we get more information about the usage of the system.
+This is experimental, and these metrics are expected to evolve to better capture relevant trends as more information becomes available about system usage.
 
-## Grafana dashboard
-
-We have a prebuilt Grafana dashboard that makes use of many of the metrics that we expose.
-
-![kafka](../images/dashboard.png)
-
-### Local Use
-
-Grafana and Prometheus are available when you run Seldon locally. You will be able to connect to the Grafana
-dashboard at `http://localhost:3000`. Prometheus will be available at `http://localhost:9090`.
-
-### Kubernetes Installation
-
-Download the dashboard from [SCv2 dashboard](https://github.com/SeldonIO/seldon-core/blob/v2/prometheus/dashboards/seldon.json)
-and import it in Grafana, making sure that the data source is pointing to the correct Prometheus store.
-Find more information on how to import the dashboard [here](https://grafana.com/docs/grafana/latest/dashboards/export-import/).
 
 ### Local Metrics Examples
 

@@ -1,46 +1,33 @@
 # Usage Metrics
 
-There are various interesting system metrics about how Seldon Core v2 is used. These metrics can be
-recorded **anonymously** and sent to Seldon by a lightweight, optional, stand-alone component called Hodometer.
+There are various interesting system metrics about how Seldon Core 2 is used. These metrics can be recorded **anonymously** and sent to Seldon by a lightweight, optional, stand-alone component called Hodometer.
 
-When provided, these metrics will be used to understand the adoption of Seldon Core v2 and how people interact
-with it. For example, knowing how many clusters Seldon Core v2 is running on, if it is used in Kubernetes or for
-local development, and how many people are benefitting from features like multi-model serving.
+When provided, these metrics are used to understand the adoption of Seldon Core 2 and how you interact with it. For example, knowing how many clusters Seldon Core 2 is running on, if it is used in Kubernetes or for local development, and how many users are benefitting from features such as multi-model serving.
 
 ## Architecture
 
 ![Hodometer architecture](../images/hodometer-architecture.png)
 
-Hodometer is not an integral part of Seldon Core v2, but rather an independent component which connects to
-the public APIs of the Seldon Core v2 scheduler. If deployed in Kubernetes, it will also try to request
-some basic information from the Kubernetes API.
+Hodometer is not an integral part of Seldon Core 2, but rather an independent component which connects to the public APIs of the Seldon Core 2 scheduler. If deployed in Kubernetes, it requests some basic information from the Kubernetes API.
 
 Recorded metrics are sent to Seldon and, optionally, to any [additional endpoints](#extra-publish-urls) you define.
 
 ## Privacy
 
 Hodometer was explicitly designed with privacy of user information and transparency of implementation in mind.
 
-It does not record any sensitive or identifying information. For example, it has no knowledge of IP addresses,
-model names, or user information. All information sent to Seldon is anonymised with a completely random
-cluster identifier.
+It does not record any sensitive or identifying information. For example, it has no knowledge of IP addresses, model names, or user information. All information sent to Seldon is anonymised with a completely random cluster identifier.
 
-Hodometer supports [different information levels](#metrics-levels), so you have full control over what
-metrics are provided to Seldon, if any.
+Hodometer supports [different information levels](#metrics-levels), so you have full control over what metrics are provided to Seldon, if any.
 
-For transparency, the implementation is fully open-source and designed to be easy to read. The full source
-code is available [here](https://github.com/seldonio/seldon-core/tree/v2/hodometer), with metrics defined in
-code [here](https://github.com/seldonio/seldon-core/tree/v2/hodometer/pkg/hodometer/metrics.go). See
+For transparency, the implementation is fully open-source and designed to be easy to read. The full source code is available [here](https://github.com/seldonio/seldon-core/tree/v2/hodometer), with metrics defined in code [here](https://github.com/seldonio/seldon-core/tree/v2/hodometer/pkg/hodometer/metrics.go). See
 [below](#list-of-metrics) for an equivalent table of metrics.
 
 ## Performance
 
-Metrics are collected as periodic snapshots a few times per day. They are lightweight to collect, coming
-mostly from the Seldon Core v2 scheduler, and are heavily aggregated. As such, they should have minimal
-impact on CPU, memory, and network consumption.
+Metrics are collected as periodic snapshots a few times per day. They are lightweight to collect, coming mostly from the Seldon Core v2 scheduler, and are heavily aggregated. As such, they should have minimal impact on CPU, memory, and network consumption.
 
-Hodometer does not store anything it records, so does not have any persistent storage. As a result, it
-should not be considered a replacement for tools like Prometheus.
+Hodometer does not store anything it records, so does not have any persistent storage. As a result, it should not be considered a replacement for tools like Prometheus.
 
 ## Configuration