Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[update] Operational Monitoring new IA #6257

Open
wants to merge 1 commit into
base: v2
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 8 additions & 17 deletions docs-gb/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,24 +10,15 @@
* [Managed Kafka](installation/production-environment/kafka/managed-kafka.md)
* [Ingress Controller](installation/production-environment/ingress-controller/istio.md)
* [Test the Installation](installation/test-installation.md)

* [Getting Started](getting-started/README.md)
* [Docker Installation](getting-started/docker-installation.md)
* [Kubernetes Installation](getting-started/kubernetes-installation/README.md)
* [Ansible](getting-started/kubernetes-installation/ansible.md)
* [Helm](getting-started/kubernetes-installation/helm.md)
* [Security](getting-started/kubernetes-installation/security/README.md)
* [AWS MSK mTLS](getting-started/kubernetes-installation/security/aws-msk-mtls.md)
* [AWS MSK SASL](getting-started/kubernetes-installation/security/aws-msk-sasl.md)
* [Azure Event Hub SASL Example](getting-started/kubernetes-installation/security/azure-event-hub-sasl.md)
* [Confluent Cloud Oauth 2.0 Example](getting-started/kubernetes-installation/security/confluent-oauth.md)
* [Confluent Cloud SASL Example](getting-started/kubernetes-installation/security/confluent-sasl.md)
* [Strimzi mTLS Example](getting-started/kubernetes-installation/security/strimzi-mtls.md)
* [Strimzi SASL Example](getting-started/kubernetes-installation/security/strimzi-sasl.md)
* [Reference](getting-started/kubernetes-installation/security/reference.md)
* [Configuration](getting-started/configuration.md)
* [Managed Kafka](getting-started/managed-kafka.md)

## User Guide
* [Getting Started](getting-started/README.md)
* [Seldon CLI](getting-started/cli.md)
* [Operational Monitoring](operational-monitoring/README.md)
* [Observability](operational-monitoring/observability.md)
* [Operational Metrics](operational-monitoring/operational.md)
* [Usage Metrics](operational-monitoring/usage.md)
* [Local Metrics](operational-monitoring/local-metrics-test.md)
* [APIs](apis/README.md)
* [Internal](apis/internal/README.md)
* [Chainer](apis/internal/chainer.md)
Expand Down
21 changes: 21 additions & 0 deletions docs-gb/operational-monitoring/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@

Seldon Core 2 provides robust tools for tracking the performance and health of machine learning models in production.

## Monitoring

* Real-Time metrics: collects and displays real-time metrics from deployed models, such as response times, error rates, and resource usage.
* Model performance tracking: monitors key performance indicators (KPIs) like accuracy, drift detection, and model degradation over time.
* Custom metrics: allows you to define and track custom metrics specific to their models and use cases.
* Visualization: Provides dashboards and visualizations to easily observe the status and performance of models.

There are two kinds of metrics present in Seldon Core 2 that you can monitor:
* [operational metrics](./operational.md)
* [usage metrics](./usage.md)

Operational metrics describe the performance of components in the system. Some examples of common operational
considerations are memory consumption and CPU usage, request latency and throughput, and cache utilisation rates.
Generally speaking, these are the metrics system administrators, operations teams, and engineers will be interested in.

Usage metrics describe the system at a higher and less dynamic level. Some examples include the number of deployed
servers and models, and component versions. These are not typically metrics that engineers need insight into, but
may be relevant to platform providers and operations teams.
128 changes: 128 additions & 0 deletions docs-gb/operational-monitoring/observability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
description: >-
Installing kube-prometheus-stack in the same Kubernetes cluster that hosts the
Seldon Core 2.
---

# Monitoring

`kube-prometheus`, also known as Prometheus Operator, is a popular open-source project that provides complete monitoring and alerting solutions for Kubernetes clusters. It combines tools and components to create a monitoring stack for Kubernetes environments.

{% hint style="info" %}
**Note**: Always install Prometheus within the same Kubernetes cluster as the Seldon Core 2.
Copy link
Member

@lc525 lc525 Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we specify this constraint? Core 2 should run just fine with an externally-managed Prometheus instance (when configured correctly).

{% endhint %}

The Seldon Core 2, along with any deployed models, automatically exposes metrics to Prometheus. By default, certain alerting rules are pre-configured, and an alertmanager instance is included.

You can install `kube-prometheus` to monitor Seldon components, and ensure that the appropriate `ServiceMonitors` are in place for Seldon deployments. The analytics component is configured with the Prometheus integration. The monitoring for Seldon Core 2 is based on the Prometheus Operator and the related `PodMonitor` and `PrometheusRule` resources.

Monitoring the model deployments in Seldon Core 2 involves:

1. [Installing kube-prometheus](observability.md#installing-kube-prometheus)
2. [Configuring monitoring](observability.md#configuring-monitoring-for-seldon-core-2)

## Prerequisites

1. Install [Seldon Core 2](../installation/production-environment/).
2. Install [Ingress Controller](../installation/production-environment/ingress-controller/).
3. Install [Grafana](https://grafana.com/docs/grafana/latest/setup-grafana/installation/helm/) in the namespace `seldon-monitoring`.

## Installing kube-prometheus

1. Create a namespace for the monitoring components of Seldon Core 2.

```
kubectl create ns seldon-monitoring || echo "Namespace seldon-monitoring already exists"
```
4. Create a YAML file to specify the initial configuration. For example, create the `prometheus-values.yaml` file. Use your preferred text editor to create and save the file with the following content:

```yaml
fullnameOverride: seldon-monitoring
kube-state-metrics:
extraArgs:
metric-labels-allowlist: pods=[*]
```

**Note**: Make sure to include `metric-labels-allowlist: pods=[*]` in the Helm values file. If you are using your own Prometheus Operator installation, ensure that the pods labels, particularly `app.kubernetes.io/managed-by=seldon-core`, are part of the collected metrics. These labels are essential for calculating deployment usage rules.
5. Change to the directory that contains the `prometheus-values` file and run the following command to install version `9.5.12` of `kube-prometheus`.

```
helm upgrade --install prometheus kube-prometheus \
--version 9.5.12 \
--namespace seldon-monitoring \
--values prometheus-values.yaml \
--repo https://charts.bitnami.com/bitnami
```

When the installation is complete, you should see this:

```
WARNING: There are "resources" sections in the chart not set. Using "resourcesPreset" is not recommended for production. For production installations, please set the following values according to your workload needs:
- alertmanager.resources
- blackboxExporter.resources
- operator.resources
- prometheus.resources
- prometheus.thanos.resources
+info https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/

```
6. Check the status of the installation.

```
kubectl rollout status -n seldon-monitoring deployment/seldon-monitoring-operator
```

When the installation is complete, you should see this:

```
Waiting for deployment "seldon-monitoring-operator" rollout to finish: 0 of 1 updated replicas are available...
deployment "seldon-monitoring-operator" successfully rolled out
```

## Configuring monitoring for Seldon Core 2

1. You can access Prometheus from outside the cluster by running the following commands:

```
echo "Prometheus URL: http://127.0.0.1:9090/"
kubectl port-forward --namespace seldon-monitoring svc/seldon-monitoring-prometheus 9090:9090
```
2. You can access Alertmanager from outside the cluster by running the following commands:

```
echo "Alertmanager URL: http://127.0.0.1:9093/"
kubectl port-forward --namespace seldon-monitoring svc/seldon-monitoring-alertmanager 9093:9093
```
3. Apply the Custom RBAC Configuration settings for kube-prometheus.
```bash
CUSTOM_RBAC=https://raw.githubusercontent.com/SeldonIO/seldon-core/v2.8.2/prometheus/rbac

kubectl apply -f ${CUSTOM_RBAC}/cr.yaml
```
4. Configure metrics collection by createing the following `PodMonitor` resources.
```bash
PODMONITOR_RESOURCE_LOCATION=https://raw.githubusercontent.com/SeldonIO/seldon-core/v2.8.2/prometheus/monitors

kubectl apply -f ${PODMONITOR_RESOURCE_LOCATION}/agent-podmonitor.yaml
kubectl apply -f ${PODMONITOR_RESOURCE_LOCATION}/envoy-servicemonitor.yaml
kubectl apply -f ${PODMONITOR_RESOURCE_LOCATION}/pipelinegateway-podmonitor.yaml
kubectl apply -f ${PODMONITOR_RESOURCE_LOCATION}/server-podmonitor.yaml
```
When the resources are created, you should see this:
```bash
podmonitor.monitoring.coreos.com/agent created
servicemonitor.monitoring.coreos.com/envoy created
podmonitor.monitoring.coreos.com/pipelinegateway created
podmonitor.monitoring.coreos.com/server created
```
## Next

### Prometheus User Interface
You may now be able to check the status of Seldon components in Prometheus:
1. Open your browser and navigate to `http://127.0.0.1:9090/` to access Prometheus UI from outside the cluster.
1. Go to **Status** and select **Targets**.

The status of all the endpoints and the scrape details are displayed.

### Grafana
You can view the metrics in Grafana Dashboard after you set Prometheus as the Data Source, and import `seldon.json` dashboard located at `seldon-core/v2.8.2/prometheus/dashboards` in [GitHub repository](https://github.com/SeldonIO/seldon-core/tree/v2/prometheus/dashboards).
Original file line number Diff line number Diff line change
@@ -1,13 +1,10 @@
# Operational Metrics

While the system is running we collect metrics via Prometheus that allow users to observe different
aspects of SCv2 with regards to throughput, latency, memory, CPU etc. This is in addition to the standard
Kubernetes metrics that are scraped by Prometheus. There is a also a Grafana dashboard (referenced below)
that provides an overview of the system.
While the system runs, Prometheus collects metrics that enable you to observe various aspects of Seldon Core 2, including throughput, latency, memory, and CPU usage. In addition to the standard Kubernetes metrics scraped by Prometheus, a [Grafana dashboard](observability.md/#grafana) provides a comprehensive system overview.

## List of SCv2 metrics
## List of Seldon Core 2 metrics

The list of SCv2 metrics that we are compiling is as follows.
The list of Seldon Core 2 metrics that are compiling is as follows.

For the agent that sits next to the inference servers:

Expand Down Expand Up @@ -54,29 +51,10 @@ const (
)
```

Many of these metrics are model and pipeline level counters and gauges. We also aggregate some of
these metrics to speed up the display of graphs. We don't presently store per-model histogram metrics
for performance reasons. However, we do presently store per-pipeline histogram metrics.
Many of these metrics are model and pipeline level counters and gauges. Some of these metrics are aggregated to speed up the display of graphs. Currently,per-model histogram metrics are not stored for performance reasons. However, per-pipeline histogram metrics are stored.

This is experimental and these metrics are bound to change to reflect the trends we want to capture as
we get more information about the usage of the system.
This is experimental, and these metrics are expected to evolve to better capture relevant trends as more information becomes available about system usage.

## Grafana dashboard

We have a prebuilt Grafana dashboard that makes use of many of the metrics that we expose.

![kafka](../images/dashboard.png)

### Local Use

Grafana and Prometheus are available when you run Seldon locally. You will be able to connect to the Grafana
dashboard at `http://localhost:3000`. Prometheus will be available at `http://localhost:9090`.

### Kubernetes Installation

Download the dashboard from [SCv2 dashboard](https://github.com/SeldonIO/seldon-core/blob/v2/prometheus/dashboards/seldon.json)
and import it in Grafana, making sure that the data source is pointing to the correct Prometheus store.
Find more information on how to import the dashboard [here](https://grafana.com/docs/grafana/latest/dashboards/export-import/).

### Local Metrics Examples

Expand Down
Original file line number Diff line number Diff line change
@@ -1,46 +1,33 @@
# Usage Metrics

There are various interesting system metrics about how Seldon Core v2 is used. These metrics can be
recorded **anonymously** and sent to Seldon by a lightweight, optional, stand-alone component called Hodometer.
There are various interesting system metrics about how Seldon Core 2 is used. These metrics can be recorded **anonymously** and sent to Seldon by a lightweight, optional, stand-alone component called Hodometer.

When provided, these metrics will be used to understand the adoption of Seldon Core v2 and how people interact
with it. For example, knowing how many clusters Seldon Core v2 is running on, if it is used in Kubernetes or for
local development, and how many people are benefitting from features like multi-model serving.
When provided, these metrics are used to understand the adoption of Seldon Core 2 and how you interact with it. For example, knowing how many clusters Seldon Core 2 is running on, if it is used in Kubernetes or for local development, and how many users are benefitting from features such as multi-model serving.

## Architecture

![Hodometer architecture](../images/hodometer-architecture.png)

Hodometer is not an integral part of Seldon Core v2, but rather an independent component which connects to
the public APIs of the Seldon Core v2 scheduler. If deployed in Kubernetes, it will also try to request
some basic information from the Kubernetes API.
Hodometer is not an integral part of Seldon Core 2, but rather an independent component which connects to the public APIs of the Seldon Core 2 scheduler. If deployed in Kubernetes, it requests some basic information from the Kubernetes API.

Recorded metrics are sent to Seldon and, optionally, to any [additional endpoints](#extra-publish-urls) you define.

## Privacy

Hodometer was explicitly designed with privacy of user information and transparency of implementation in mind.

It does not record any sensitive or identifying information. For example, it has no knowledge of IP addresses,
model names, or user information. All information sent to Seldon is anonymised with a completely random
cluster identifier.
It does not record any sensitive or identifying information. For example, it has no knowledge of IP addresses, model names, or user information. All information sent to Seldon is anonymised with a completely random cluster identifier.

Hodometer supports [different information levels](#metrics-levels), so you have full control over what
metrics are provided to Seldon, if any.
Hodometer supports [different information levels](#metrics-levels), so you have full control over what metrics are provided to Seldon, if any.

For transparency, the implementation is fully open-source and designed to be easy to read. The full source
code is available [here](https://github.com/seldonio/seldon-core/tree/v2/hodometer), with metrics defined in
code [here](https://github.com/seldonio/seldon-core/tree/v2/hodometer/pkg/hodometer/metrics.go). See
For transparency, the implementation is fully open-source and designed to be easy to read. The full source code is available [here](https://github.com/seldonio/seldon-core/tree/v2/hodometer), with metrics defined in code [here](https://github.com/seldonio/seldon-core/tree/v2/hodometer/pkg/hodometer/metrics.go). See
[below](#list-of-metrics) for an equivalent table of metrics.

## Performance

Metrics are collected as periodic snapshots a few times per day. They are lightweight to collect, coming
mostly from the Seldon Core v2 scheduler, and are heavily aggregated. As such, they should have minimal
impact on CPU, memory, and network consumption.
Metrics are collected as periodic snapshots a few times per day. They are lightweight to collect, coming mostly from the Seldon Core v2 scheduler, and are heavily aggregated. As such, they should have minimal impact on CPU, memory, and network consumption.

Hodometer does not store anything it records, so does not have any persistent storage. As a result, it
should not be considered a replacement for tools like Prometheus.
Hodometer does not store anything it records, so does not have any persistent storage. As a result, it should not be considered a replacement for tools like Prometheus.

## Configuration

Expand Down
Loading